λm.me - Typing extensible records in Giml

Typing extensible records in Giml

- 2021-04-10 -

In the last blog post we covered the general structure and algorithms of Giml's type inference engine.

In this blog post we'll take a closer look at extensible records in Giml and how to infer their types.

Records

A record is a collection of values, each associated with a name (also called a label). For example, { name = "Giml", age = 0 }. The type of records is very similar to their structure for example the type of the record above is: { age : Int, name : String }.

Records are used to build compound data aggregating multiple values into one value that contains all of the information. The different values (also called fields) inside the record can be accessed (or selected) by their labels, like this: <record>.<label>.

Records can also be extended with additional fields:

let
    lang = { name = "Giml", age = 0 }
in
    { website = "https://giml-lang.org" | lang }

The result of the expression is a new record that has all of the fields lang has and has one extra field website as well. Note that in Giml, we can't have multiple values with the same label. When we add a new field with an existing label, that previous one is replaced with the new value.

There are other reasonable behaviours here such as disallowing such operation or having a scope of labels, but this is the behaviour I chose for Giml :)

Records are first class and can be used anywhere an Int can be expected. They can also be pattern matched:

case { name = "Giml" } of
    | { name = "Giml" } ->
        "Yeah I heard about it, funny name."

    | { name = "Haskell" } ->
        "Haskell is pretty cool!"

    | other ->
        concat (concat "I'm sure " other.name) " is pretty good too!"
end

So far we've seen record literals and three operations on records: field selection, extension and pattern matching.

When we approach to type inference of features we need to take into consideration a few things:

What the types of the feature look like
What to do for each operation (elaboration and constraints generation)
How to unify the types (constraint solving)
How to instantiate
How to substitute

We are going to first try a naive approach: Just represent the types of record as TypeRec [(Label, Type)] and see how it's just not enough to make this feature useful without type annotations.

Let see how we elaborate each operation of Record literals, field selection, record extension, pattern matching:

Record literals: simple, we have an expression that looks like this: { label1 = expr1, label2 = expr2, ... }, so we elaborate each expression and annotate the node with the type TypeRec [(label1, t1), ...].

Record selection: Also simple, we have something that looks like expr.label, so we elaborate the type of the expression and constrain it as equals to TypeRec [(label, t)].

But hold up, if we try to work with this scheme we'll see that it is not good enough. Trying to typecheck this expression { a = 1, b = 1 }.a will create this constraint: Equality { a : Int, b : Int } { a : t }, and while they both have a field a, one of them is missing a field! So they are definitely not equal.

So we need to go at this differently and there are several approaches, one is to create a new type of constraint that describe the relationship of "subtyping", another is to describe the types differently.

For Giml I chose the second approach, we add extra information to the types to encode that "this record type may have more fields". This approach is called row polymorphism.

Row Polymorphism

Row polymorphism provides us with the ability to represent extra information in an additional type variable (called a row type variable) in the type.

So in addition to the variant we created TypeRec [(Label, Type)], we add another one: TypeRecExt [(Label, Type)] TypeVar. Syntactically the type will look like this:

{ <field1> : <type1>, ... | <extension-type-variable> }

What this extra type variable does is represent additional fields that may be part of the type that we don't know yet.

Let's use this for record field selection, instead of saying "this type is equals to a record with one field of type t", we can now say "this type is equals to a record that has this field with this type, and other fields represented by this type variable".

With this scheme we can move forward. Let's try again:

Elaboration and constraint generation

Record literals

same as before

Record selection

We have something that looks like expr.label, so we:

Generate a type variable representing the return type (which is the field type), let's call this t
Elaborate the type of the expression
Generate a type variable representing the rest of the fields, let's call this ext
Constrain the type of the expression as equals to { label : t | ext }.

Now when we try to typecheck the expression { a = 1, b = 1 }.a, which generates this constraint: Equality { a : Int, b : Int } { a : t | ext }, we can match the fields and types that we know (a : Int with a : t), and also match the fields that we don't ({ b : Int } with ext) - more on that later.

And when we find out what the real type of ext is, we'll substitute it (we'll talk about that later as well).

Record extension

We have something that looks like { <field1> = <expr1>, ... | <expression> }, so we:

Elaborate the type of each field
Elaborate the type of the expression
Generate a type variable for the expression which we will call ext
Constrain the type of the expression to be equal to { | ext }
Annotate the node with the type { <field1> : <type1>, ... | ext }

Pattern matching

Pattern matching is similar to record literals case, but instead of matching the expression in the case with a rigid TypeRec, we generate a type variable and match with TypeRecExt. That way we can select less fields in the pattern than might be available in the expression type.

Constraint solving

We need to handle 3 new cases - equality between:

Two TypeRec
A TypeRec and a TypeRecExt
Two TypeRecExt

Two TypeRec

The two records should have the same labels and for each label we generate a new constraint of Equality between the types for the label on each side.

So for example { a : Int, b : String } and { a : t1, b : t2 }, we generate the two constraint: Equality Int t1 and Equality String t2.

If one TypeRec has a field that the other do not have, we throw an error of a missing field.

A TypeRec and a TypeRecExt

Each field in the TypeRecExt should match with the matching field in TypeRec. If there's a field in TypeRecExt that does not exist in TypeRec we throw an error.

The other side is different, all the missing field in TypeRecExt that exist in TypeRec will be matched with the row type variable of TypeRecExt. Remember - we said that with row polymorphism we use a type variable as a representative of fields we don't know of yet! So we treat the type variable in TypeRecExt as if it is a TypeVar that matches the fields that exist in the TypeRec but not in the TypeRecExt.

So for example in Equality { a : Int, b : String } { a : t1 | ext } we generate 2 new constraints: Equality Int t1 and Equality { b : String } ext.

Two TypeRecExt

This scenario is slightly trickier, of course - the specified fields in both sides should be matched just like in previous cases, but what do we do with the missing fields from each side?

Let's check a concrete example, we'll make it simpler by not including matching fields.

Equality { a : Int | t1 } { b : Int | t2 }

What this constraint mean, is that t2 is equals to { a : Int | t1 } without { b : Int }, and t1 is equals to { b : Int | t2 } without { a : Int }.

Since we don't have any notion of subtracting a type from a type (some other type system do support this), we can try and represent this differently, we could represent this subset as a new type variable, and translate the Equality above to these two constraints:

Equality { a : Int | t' } t2
Equality { b : Int | t' } t1

Now t' represents t2 without { a : Int } and t' also represents t1 without { b : Int }.

So more generally, when we try to unify two TypeRecExt, we match the matching fields and also create a new row type variable to represents all of the unspecified fields in both types, and match each row type variable with the fields specified on the other side.

Let's describe this one more time in psuedo code, adding new constraints:

Equality { <fields-only-found-on-the-left-side>  | new-type-var } extRight
Equality { <fields-only-found-on-the-right-side> | new-type-var } extLeft
<matched-fields>

Instantiation

Instantiation occurs as usual, the row type variable in TypeRecExt should be instantiated as well.

Substitution

What do we do if the row type variable in TypeRecExt appears in the substitution?

One, if the row type variable is mapped to a different type variable, we just replace it.

Two, if it's mapped to a TypeRec, we merge it with the existing fields and return a TypeRec with all fields, but it's important to note that some fields in the TypeRec from the substitution may be the same as ones from our TypeRecExt.

There are a few approaches one could take here, one is to keep a scope of labels, another is to try and unify the types of the field.

In Giml we take the type of the left-most instance, discarding the right one. So semantically this expression is legal: { a = "Hi!" | { a = 1} } and its type is { a : String }.

Three, if the row type variable is mapped to a TypeRecExt, we do the same thing as in Two, but return a TypeRecExt instad of a TypeRec with the row type variable from the mapped value as our new row type variable.

This is basically what we need to do to infer the type of extensible records.

Example

Let's see how to infer the type of a simple record field access example. This process is a bit much, but I've included it for those who want a live demonstration of how this works. Feel free to skip it!

giml = { name = "Giml", age = 0 }

getName record = record.name

gimlName = getName giml

The first stage of inference is to group dependencies and order them topologically. Since we don't have any mutual dependencies, each definition is standalone and the order of dependencies is kept (the last definition uses the previous two).

Also note that the names of type variables do not matter to the algorithm. They are a bit different so I have better time knowing where they are introduced.

First group

giml = { name = "Giml", age = 0 }

Elaboration

Through elaboration, we end up with the following AST and constraints:

-- Ast:

 +-- { age : Int, name : String }
 |               |
 |               |
 |     __________|_______________
giml = { name = "Giml", age = 0 }


-- Constraints:

[ Equality top0 {age : Int, name : String}
]

Constraint solving

We go over the first group and solve the constraints:

-- Constraints: [Equality top0 {age : Int, name : String}]
-- Substitution: []

1. Equality top0 {age : Int, name : String}
     -- => add `top0 := {age : Int, name : String}` to the substitution


-- Constraints: []
-- Substitution: [top0 := {age : Int, name : String}]

From this process we learn that the definition giml has the type { name = "Giml", age = 0 }. We add this to the type environment (mapping from definition names to types) and use it from the next groups.

Second group

getName record = record.name

Elaboration

Through elaboration, we end up with the following AST and constraints:

-- Ast:

   +--- targ4 -> t5
   |
   |               +--- targ4
   |               |
   |             __|___
getName record = record.name
                 -----------
                      |
                      +------- t5


-- Constraints:

[ Equality targ4 {name : t5 | t6}
, Equality tfun3 (targ4 -> t5)
, Equality top1 (targ4 -> t5)
]

Constraint solving

-- Constraints:
     [ Equality targ4 {name : t5 | t6}
     , Equality tfun3 (targ4 -> t5)
     , Equality top1 (targ4 -> t5)
     ]
-- Substitution:
     []


1. Equality targ4 {name : t5 | t6}
    -- => add `targ4 := {name : t5 | t6}` to the substitution and apply

-- Constraints:
     [Equality tfun3 ({name : t5 | t6} -> t5), Equality top1 ({name : t5 | t6} -> t5)]
-- Substitution:
     [targ4 := {name : t5 | t6}]


2. Equality tfun3 ({name : t5 | t6} -> t5)
     -- => add `tfun3 := ({name : t5 | t6} -> t5)` to the substitution and apply

-- Constraints:
     [Equality top1 ({name : t5 | t6} -> t5)]
-- Substitution:
     [ targ4 := {name : t5 | t6}
     , tfun3 := ({name : t5 | t6} -> t5)
     ]

3. Equality top1 ({name : t5 | t6} -> t5)
     -- => add `top1 := ({name : t5 | t6} -> t5)` and apply

-- Constraints:
     []
-- Substitution:
     [ targ4 := {name : t5 | t6}
     , tfun3 := ({name : t5 | t6} -> t5)
     , top1 := ({name : t5 | t6} -> t5)
     ]

Substitute

We apply the substitution to our elaborated AST:

   +--- {name : t5 | t6} -> t5
   |
   |               +--- {name : t5 | t6}
   |               |
   |             __|___
getName record = record.name
                 -----------
                      |
                      +------- t5

Generalize

We then generalize the type by finding all of the free variables, giving them nicer names, and wrap the type of the top-level definition in a forall:

We find t5 which we will name a, and t6 which we will name b:

   +--- forall a b. {name : a | b} -> a
   |
   |               +--- {name : a | b}
   |               |
   |             __|___
getName record = record.name
                 -----------
                      |
                      +------- a

We add this definition to the type environment.

Third group

gimlName = getName giml

Elaboration

Through elaboration, we end up with the following AST and constraints:

-- Ast:

   +--- t9
   |
   |
gimlName = getName giml
              |     |
              |     |
              |     +---- top0_i8
              |
              +---- top1_i7

-- Constraints:

[ Equality top1_i7 (top0_i8 -> t9)
, Equality top2 t9
, Equality top0_i8 {name = "Giml", age = 0}
, Equality top1_i7 (forall a b. {name : a | b} -> a)
]

Constraint solving

-- Constraints:
     [ Equality top1_i7 (top0_i8 -> t9)
     , Equality top2 t9
     , Equality top0_i8 {name = "Giml", age = 0}
     , Equality top1_i7 (forall a b. {name : a | b} -> a)
     ]
-- Substitution:
     []

1. Equality top1_i7 (top0_i8 -> t9)
     -- => add `top1_i7 := (top0_i8 -> t9)`

-- Constraints:
     [ Equality top2 t9
     , Equality top0_i8 {name = "Giml", age = 0}
     , Equality top1_i7 (forall a b. {name : a | b} -> a)
     ]
-- Substitution:
     [ top1_i7 := (top0_i8 -> t9)
     ]

 2. Equality top2 t9
    -- add `top2 := t9` and apply

-- Constraints:
     [ Equality top0_i8 {name = "Giml", age = 0}
     , Equality top1_i7 (forall a b. {name : a | b} -> a)
     ]
-- Substitution:
     [ top1_i7 := (top0_i8 -> t9)
     , top2 := t9
     ]

 3. Equality top0_i8 {name = "Giml", age = 0}
      -- => we add `top0_i8 := {age : Int, name : String}`


-- Constraints:
     [ Equality ({age : Int, name : String} -> t9) (forall a b. {name : a | b} -> a)
     ]
-- Substitution:
     [ top1_i7 := ({age : Int, name : String} -> t9)
     , top2 := t9
     , top0_i8 := {age : Int, name : String}
     ]

4. Equality ({age : Int, name : String} -> t9) (forall a b. {name : a | b} -> a)
     -- => new constraint:
     -- `Equality ({age : Int, name : String} -> t9) ({name : ti10 | ti11} -> ti10)`

-- Constraints:
     [ Equality ({age : Int, name : String} -> t9) ({name : ti10 | ti11} -> ti10)
     ]
-- Substitution:
     unchanged

5. Equality ({age : Int, name : String} -> t9) ({name : ti10 | ti11} -> ti10)
     -- => add constraints `Equality {age : Int, name : String} {name : ti10 | ti11}`
     --    and `Equality t9 ti10`

-- Constraints:
     [ Equality {age : Int, name : String} {name : ti10 | ti11}
     , Equality t9 ti10
     ]
-- Substitution:
     unchanged

6. Equality {age : Int, name : String} {name : ti10 | ti11}
     -- => This is a TypeRec & TypeRecExt scenario. we match the shared fields
     --    and match the row type variable with the missing fields from the typeRec
     --    adding `Equality String ti10` and `Equality ti11 { age : Int }`


-- Constraints:
     [ Equality String ti10
     , Equality ti11 { age : Int }
     , Equality t9 ti10
     ]
-- Substitution:
     [ top1_i7 := ({age : Int, name : String} -> t9)
     , top2 := t9
     , top0_i8 := {age : Int, name : String}
     ]

7. Equality String ti10
   -- => ti10 := String

-- Constraints:
     [ Equality ti11 { age : Int }
     , Equality t9 String
     ]
-- Substitution:
     [ top1_i7 := ({age : Int, name : String} -> t9)
     , top2 := t9
     , top0_i8 := {age : Int, name : String}
     , ti10 := String
     ]

7. Equality ti11 { age : Int }
   -- => ti11 := { age : Int }

-- Constraints:
     [ Equality t9 String
     ]
-- Substitution:
     [ top1_i7 := ({age : Int, name : String} -> t9)
     , top2 := t9
     , top0_i8 := {age : Int, name : String}
     , ti10 := String
     , ti11 := { age : Int }
     ]

8. Equality t9 String
   -- =>  t9 := String

-- Constraints: []
-- Substitution:
     [ top1_i7 := ({age : Int, name : String} -> String)
     , top2 := String
     , top0_i8 := {age : Int, name : String}
     , ti10 := String
     , ti11 := { age : Int }
     , t9 := String
     ]

Phew, we made it! Let's apply the substitution back to our AST and see the result!

   +--- String
   |
   |
gimlName = getName giml
              |     |
              |     |
              |     +---- {age : Int, name : String}
              |
              +---- {age : Int, name : String} -> String

This look correct! (and yey me who did the constraint solving by hand)

Summary

I found implementing extensible records using row polymorphism to be relatively straightforward. I hope this article makes it more approachable as well.

In the next blog post we'll discuss polymorphic variants and how to infer their types.