The type system machine code wants

now about contact

Dec 9, 2019

The type system machine code wants

[Update 2020-01-10: This post is now out of date with the design of Mu's type system.]

The design of the Mu computer recently got a lot clearer in my head, after I went back and carefully annotated 45k lines of machine code with types. These types are just code comments, of course; I don't get any type-checking in machine code, whether “ahead of time” or at run-time. However, it took me nearly a year of programming to notice that there still is a type system in my head, and it's not quite what I'm used to when programming in Python or even C. Laying it out explicitly has made it a good candidate for Mu's level-2 language, which will be strongly typed and memory-safe, but still mostly map 1:1 to machine code.

Why does Mu's type system need something more than C? Since I want to translate 1:1 to machine code, and since machine code is constrained on what registers can do, Mu variables are explicitly allocated to either register or memory. This decision creates two cascading constraints:

You can store word sizes in registers, but not larger values.
Things you store in memory require addressing modes to get to.

As a result, what we think of as 'variables' bifurcate into two categories.

A word-size variable in a register is the only thing you can refer to cleanly 'by value':

var x/eax : int

(Read this syntax as "x is an int in register eax." See this previous post for details.)

A larger variable in memory looks kinda like this:

var x : (ref point) # say 'point' is an (x, y) co-ordinate

This declaration implies two facts:

`x` has type "address to point". Getting at its value requires a memory lookup (or a pointer dereference, in C parlance).
`x` has been allocated enough space for a point. The storage is tightly coupled with the variable. Like a C++ reference, `x` can't ever be bound to other storage for the rest of its lifetime.

Here's a larger variable in memory, with its address in a register:

var x/ecx : (ref point)

A word-size variable in memory, with its address in a register:

var x/ecx : (ref int)

Hopefully it's obvious by now that non-refs in memory are meaningless:

var x : int # you need to get at x's value using a dereference,
            # so it has to be some sort of address

Working with `ref`s

We all know that working with C's pointers is error-prone and the cause of many security issues. Mu is even more low-level, and value types in other languages turn into addresses. How can we keep all these extra addresses straight in our heads?

The key is to be very deliberate about copying. C lets you copy variables of any type with the same assignment operator, implicitly triggering block copies as necessary. Block copies can further involve copying other addresses around. Mu is more explicit. To trigger a block copy you have to copy one `ref` into another `ref`.

This means `ref`s have none of the benefits of addresses or pointers. How can we introduce aliasing when we need it? One common use of aliasing is to avoid expensive copies when passing large objects between functions. Mu introduces a second type for this: `address`.

fn foo n : (ref point) { # n is passed by value
fn foo n : (address point) { # n is passed by reference

Confusing, I know… Alternative suggestions for names (or anything else) most welcome.

An `address` is intended to be a short-lived entity. It can't be stored on the heap. Structs can't contain 'address' members. These constraints permit the guarantee that while you have it, it's not going anywhere. If it's on the stack it's in a stack-frame below you which is guaranteed to outlive you. If it's on the global segment it's eternal anyway (what Rust calls the `static` lifetime).

(Allocations on the heap don't get 'ref's, they get handles. Handles can also turn into addresses, though with a runtime check that can crash the program. Handles will get their own article at some point, once I feel confident I've worked through the details. For now I assume any data that requires aliasing beyond pass-by-reference is allocated on the heap and managed using handles.)

comments

Anonymous, 2019-12-11: > An `address` is intended to be a short-lived entity. It can't be stored on the heap. Structs can't contain 'address' members. These constraints permit the guarantee that while you have it, it's not going anywhere.

Reminds me of C#'s `Span`...

> Span is a ref struct that is allocated on the stack rather than on the managed heap. Ref struct types have a number of restrictions to ensure that they cannot be promoted to the managed heap, including that they can't be boxed, they can't be assigned to variables of type Object, dynamic or to any interface type, they can't be fields in a reference type...

Kartik Agaram, 2019-12-11: Thank you!

Anton Dyudin, 2019-12-14: Personally I'd go for `own point` for the version you copy around by value, and `ref point` for the region types you're passing by reference. (Leading to `box point` for handles on the heap: mostly drawing on rust here)

Comments gratefully appreciated. Please send them to me by any method of your choice and I'll include them here.