Giml's type inference engine

- 2021-04-06 -

Note: this article was last updated on 2022-06-07

Giml's type inference engine uses unification-based constraint solving.

It has 6 important parts:

Topologically order definitions and group those that depend on one another
Elaboration and constraint generation
Constraint solving
Instantiation
Substitution
Generalization

The general process is as follows: we sort and group definitions by their dependencies, we elaborate the program with type variables and types that we know and collect constraints on those variables. We then solve the constraints for each group using unification, instatiating polymorphic types we run into, and create substitutions, which are mapping from type variables to types, and merge them together. Then we substitute the type variables we generated in the elaboration stage with their mapped types from the substitution in the program, and then generalize the types of top-level definitions and close over their free variables.

Lets unpack all of that, step by step.

Group and order definitions

Before we start actually doing type inference, we figure out the dependencies between the various definitions. This is because we want to figure out the type of a definition before its use, and because we want definitions that depend on each other (mutual recursion) to be solved together.

So for example if we have this file as input:

main = ffi("console.log", odd 17)

even x =
    if int_equals x 0
        then True
        else if int_equals x 1
            then False
            else odd (sub x 1)

odd x =
    if int_equals x 0
        then False
        else if int_equals x 1
            then True
            else even (sub x 1)

The output of the reordering and grouping algorithm will be:

[["even", "odd"], ["main"]]

This is because even and odd are mutually recursive, so we want to typecheck them together, and because main depends on odd we want to typecheck it after we finished typechecking odd.

The module implementing this algorithm is Language.Giml.Rewrites.PreInfer.GroupDefsByDeps.

Elaboration

Now the real work begins. We start by going over the program's AST and annotate it with the types we know. When we don't know what the type of something is we "invent" a new fresh type variable as a placeholder and mark that type with a constraint according to its usage.

Types

Language.Giml.Types.Types contains a datatype that can be used to represent Giml types. For example:

TypeCon "Int" represents the type Int
TypeVar "t1" represents a polymorphic type variable
TypeApp (TypeCon "Option") (TypeCon "Int") represents the type Option Int
TypeRec [("x", TypeCon "Int")] represents a record with the type { x : Int }

Constraints

A Constraint between two types describes their relationship.

Currently we only have one type of constraint: Equality. Which means that the two types are interchangeable and should unify to the same type. For example from the constraint Equality (TypeVar t1) (TypeCon "Int") we learn that t1 is the same as Int.

One instance of an Equality constraints stands out from the rest - when the second argument is a TypeScheme. In that case we mean to say that the first type of the Equality has an Equality relationship with the instantiation of the second type.

An instantiation is a monomorphic instance of a polymorphic type, for example forall a b. (a -> b) -> List a -> List b is polymorphic, and can be instantiated, among other examples, to (Int -> String) -> List Int -> List String by replacing the polymorphic type variables a and b which are declared in the forall clause with monomorphic types such as Int and String.

We will talk about solving these constraints a bit later.

Implementation

We also use State, Reader and Except in our algorithm:

The State contains:
- A monotonically increasing integer for generating type variables
- A list of constraints we generated through the elaboration phase, which we will return as output at the end
The Reader contains:
- The currently scoped variables in the environment and their type
- The type of builtin functions and expressions (such as int_equals and pure)
- An environment of data definitions, some are built-in and others are user defined and added as part of the elaboration stage
The Except part can throw a few errors such as "unbound variable" error.

Most of the action happens in elaborateExpr. I'm not going to cover every usecase, but lets look at a few examples:

Elaborating literals

Elaborating literals is simple - we already know what the types are - if we see a string literal we annotate the AST node with the type String, if we see an integer literal we annotate it with Int, and we don't need to generate new constraints because we know exactly what the type is.

Elaborating lambda functions

Elaborating lambda expressions is interesting, because we don't know what type each argument is going to be: it might be inferred by the usage like in this case: \n -> add 1 n or not like in this case \x y -> x. What we do here is:

Generate a type variable for each function argument
Elaborate the body of the lambda with the extended environment containing the mapping between the function arguments and their newly generated types
Annotate the lambda AST node with the type of a function from the argument types to the type of the body, which we get from the elaborated node after (2)

And we will trust the constraint solving algorithm to unify all of the types.

Elaborating function application

To elaborate a function application we:

Elaborate the arguments
Elaborate the applied expression
Generate a fresh type variable to represent the return type
Constrain the type of the applied expression as a function from the arguments to the return type
Annotate the function application AST node with the return type (which we just generated)

Remember that the type of the applied expression may just be a generated type variable, so we can't examine its type to see what the return type is. For example if we had this expression \f x -> f x, as we saw before we generated type variables for f and x! So we have to make sure the type inference engine knows what their type should be (by generating a constraint) according to the usage and trust that it'll do its job.

Elaborating variables

For variables, we expect an earlier stage (such as elaborating lambdas, lets or definitions) to have already elaborated their type and added them to the environment, so we look them up. We first try to look them up in the scoped environment and if that fails we try the built-ins environment. If that fails as well, we throw an error that the variable is unbound.

If we do find a variable in the scoped environment, we generate a fresh variable and add constrain the type of the fresh variable to be equals to the type we found in the env.

Generating a fresh variable is important, because if the type we found in the environment is polymorphic (i.e. it is a TypeScheme), the constraint solving phase will instantiate the polymorphic type. We'll talk more about what this means a bit later.

Constraint solving

The heart of the type inference engine. Here we visit each the constraints generated from each group of substitutions separately and generate a substitution (mapping from type variables to types) as output.

the solveConstraints function takes an accumulated substitution and a list of constraints as input and tries to solve the first constraint on the list. Solving a constraint might generate more constraints to solve, and may return a substitution as well. We then apply this substitution (substitute all occurences of the keys with the values) to the rest of the constraints and the accumulated substitution itself, merge the new substitution with the accumulated substitution, and keep going until there are no more constraints left to solve.

(solveConstraint) uses a process called unification to solve a constraint. The unification process tries to match two types and produce additional clues to what concrete types they represent or fail when the types do not match.

Here are a few important cases that it handles:

`Equality t1 t2 (where t1 == t2)`

When the two types are the same (for example Int and Int, or t1 and t1), the unification succeeds without extra work needed, we don't need to create new constraints or create a substitution.

`Equality (TypeVar tv) t2`

When one of the types is a type variable, we substitute (replace) it with t2 in all places we know of (accumulated substitution and rest of the constraints). So the rest of the constraints that contain it will refer to t2 instead, and when substituting it with another type it'll be t2.

`Equality (TypeApp f1 a1) (TypeApp f2 a2)`

We generate two new constraints: Equality f1 f2 and Equality a1 a2, reducing a constraint of complex types a step further until eventually we get something we already know how to handle: a constraint on simple types.

`Equality t1 (TypeScheme vars t2)`

When the second argument of Equality is a TypeScheme, we instantiate the second type t2 and produce an Equality constraint between t1 and the instantiation of t2.

Instantiating is basically "copying" the type signature with freshly generated type variables, and producing the Equality constraint with t1 to make the type inference engine find the right type for each of the instantiated type variables.

There are a few more cases, but this is the gist of it.

When we fail to unify we throw an error of type mismatch.

Instantiation

When we run into a type scheme (a type with a forall) during the constraint solving stage, we instantiate the type by looking up all of the type variables in the type and generate a fresh new type variable for each one (though all instances of a type variable should be mapped to the same new type variable).

So for example in the type forall a b. (a -> b) -> List a -> List b, if t1 and t2 are two new type variables we haven't seen before, we can replace a with t1 and b with t2 everywhere in the type and get (t1 -> t2) -> List t1 -> List t2

After that, we let the Equality constraints and the constraint solving algorithm find the right monomorphic types for t1 and t2.

Substitution

As part of the constraint solving process, we generate substitutions when we unify type variables. We need to use the generated substitution immediately by applying it over our accumulated substitution and remaining constraints.

When we apply a substitution to a constraint or another substitution, we replace all occurences of a type variable tv with t if t := tv is in the substitution. But, we also need to check that tv does not contain t inside of it! (this is called an occurs check)

if tv does contain t, it means that the we ran into an infinite type and we don't support those here, so we will throw an error.

A simple way to generate such case is this:

x y = x

The constraint generated from this definition is: Equality top0 (t1 -> top0), which will fail the occurs check.

If you try to desugar this definition into a lambda expression, you'll quickly discover that you've been had: anytime you'll go to replace x with \y -> x you'll just add another level of lambda.

Generalization

Once we applied the substitution to the elaborated types for a group of definitions, we have figured out the types of all definitions in a group. What's left is to declare the polymorphic definitions as polymorphic in a way that when we reference these definitions in subsequent groups we'll know that we need to instantiate them.

Basically, we need to convert the types of top-level definitions to TypeSchemes (or foralls) where the type variables are the leftover type variables. For example, if we have the type t1 -> Int -> t2 -> t1, we find all of the free variables (t1 and t2), replace them in the type signature with nicer types (t1 => a, t2 => b), and write them down at the forall clause of the type (forall a b. a -> Int -> b -> a).

If there are no free variables in the top-level type signature we will generate an empty forall list, so we just omit the forall clause for convenience.

Example time!

Let's see an example for this (very) short program:

one = id 1

id = \x -> x

Topological ordering

We start with topological ordering. We can see that one depends on id, but not the other way around, so we separate them to different groups and id comes first.

[["id"], ["one"]]

First group

The first group is:

id = \x -> x

Elaboration

After the elaboration stage we generate the following elaborated AST:

      +---t1
      |
      |    +---- t2
      |    |
     ---  ---
id = \x -> x
---  -------
 |      |
 |      +----- t1 -> t2
 |
 |
 +--- top0

And constraints:

[ Equality t2 t1
, Equality top0 (t1 -> t2)
]

How did we get these constraints?

We look at the definition of id and generate the type variable top0 for it. We continue to infer the expression to the right with the environment where id := top0.
The lambda expression case generates the type t1 for the argument x.
The lambda expression infers the type of the lambda body, in the environment where id := top0 and x := t1. We look for x in the environment and then generate a new type variable t2 for it as well, and generate an equality constraint: Equality t2 t1.
The lambda expression infers its type as a function from the input type t1 to the output type t2: t1 -> t2.
After infering the type of the body of id, we constraint the type variable generated for the name id with the inferred type of the expression: Equality top0 (t1 -> t2).

Constraint solving

We go over the constraints in order and unify them, generating a substitution when needed and applying it to both the remaining constraints and the accumulating substitution.

- Constraints:
    [ Equality t2 t1
    , Equality top0 (t1 -> t2)
    ]
- Accumulated Substitution:
    [
    ]

1. Equality t2 t1
   => t2 := t1
   -- The first type is a type variable, so we add the
   -- above to the accumulated substitution and apply it

- Constraints:
    [ Equality top0 (t1 -> t1)
    ]
- Accumulated Substitution:
    [ t2 := t1
    ]

2. Equality top0 (t1 -> t1)
   => top0 := t1 -> t1
   -- Same as before

- Constraints:
    [
    ]
- Accumulated Substitution:
    [ t2 := t1
    , top0 := t1 -> t1
    ]

When we run out of constraints, we are done. Lets apply the accumulated substitution to our elaborated AST:

      +---t1
      |
      |    +---- t1
      |    |
     ---  ---
id = \x -> x
---  -------
 |      |
 |      +----- t1 -> t1
 |
 |
 +--- t1 -> t1

Generalization

We take a look of the type of id, find all of the type variables (t1), generate a type variable for each one (a) and substitute the original type variable with the new one, then we wrap the resulting type with a forall and list the new type variables:

id :: forall a. a -> a

Second group

one = id 1

Elaboration

Note: we use the types of the top-level definitions of previous groups when elaborating a group! In our case our environment only contains: id :: forall a. a -> a

We get this elaborated AST:

 +--- top0
 |
 |      +--- t2
 |      |
---   ----
one = id 1
      -- -
      |  |
      |  +--- Int
      |
      +------ t1

And constraints:

[ Equality t1 (forall a. a -> a)
, Equality t1 (Int -> t2)
, Equality top0 t2
]

How did we get these constraints?

For the definition one we generate the type variable top0.
We look at the function application in three steps, the first: looking at the argument, in this case 1 which is a int literal. We annotate it as such.
We then lookup id in our environment, generate the type variable t1, and produce a constraint between the two: Equality t1 (forall a. a -> a)
We generate a type variable t2 to represent the return value, and express our expectation that id's is equals to a function from the type of the argument Int to the return type t2: Equality t1 (Int -> t2)
Finally, we constrain the type of one with the type of the expression: Equality top0 t2

Constraint solving

Let's go over them in order:

- Constraints:
    [ Equality t1 (forall a. a -> a)
    , Equality t1 (Int -> t2)
    , Equality top0 t2
    ]
- Accumulated Substitution:
    [
    ]

1. Equality t1 (forall a. a -> a)
   => [ Equality t1 (ti_1 -> ti_1) ]
   -- A forall! Instantiate and generate a new constraint.

- Constraints:
    [ Equality t1 (ti_1 -> ti_1)
    , Equality t1 (Int -> t2)
    , Equality top0 t2
    ]
- Accumulated Substitution: unchanged

2. Equality t1 (ti_1 -> ti_1)
   => t1 := ti_1 -> ti_1

- Constraints:
    [ Equality (ti_1 -> ti_1) (Int -> t2)
    , Equality top0 t2
    ]
- Accumulated Substitution:
    [ t1 := ti_1 -> ti_1
    ]

3. Equality (ti_1 -> ti_1) (Int -> t2)
   => [ Equality (->) (->), Equality ti_1 Int, Equality ti_1 t2]

- Constraints:
    [ Equality (->) (->)
    , Equality ti_1 Int
    , Equality ti_1 t2
    [ Equality top0 t2
    ]
- Accumulated Substitution: unchanged

4. Equality (->) (->)
   => []
   -- Two type constructors are equal - safe to skip

- Constraints:
    [ Equality ti_1 Int
    , Equality ti_1 t2
    [ Equality top0 t2
    ]
- Accumulated Substitution: unchanged

5. Equality ti_1 Int
   => ti_1 := Int

- Constraints:
    [ Equality Int t2
    [ Equality top0 t2
    ]
- Accumulated Substitution:
    [ t1 := Int -> Int
    , ti_1 := Int
    ]

5. Equality Int t2
   => t2 := Int

- Constraints:
    [ Equality top0 Int
    ]
- Accumulated Substitution:
    [ t1 := Int -> Int
    , ti_1 := Int
    , t2 := Int
    ]

5. Equality top0 Int
   => top0 := Int

- Constraints:
    []
- Accumulated Substitution:
    [ t1 := Int -> Int
    , ti_1 := Int
    , t2 := Int
    , top0 := Int
    ]

Substitution

Now we apply the substitution elaborated program AST:

 +--- Int
 |
 |      +--- Int
 |      |
---   ----
one = id 1
      -- -
      |  |
      |  +--- Int
      |
      +------ Int -> Int

Generalization

Nothing to do in this case! We get:

one :: Int

Result

This concludes our type inference process, and this is the result:

      +---a
      |
      |    +---- a
      |    |
     ---  ---
id = \x -> x
---  -------
 |      |
 |      +----- a -> a
 |
 |
 +--- forall a. a -> a

 +--- Int
 |
 |      +--- Int
 |      |
---   ----
one = id 1
      -- -
      |  |
      |  +--- Int
      |
      +------ Int -> Int

Summary

This is the general overview of Giml's type inference engine. If you are interested in learning more about it, check out the source code and documentation under the Language.Giml.Types hierarchy.

In the next blog posts I will cover the type inference of extensible records and polymorphic variants.

If this helped you, there's something you want me to clarify, you think there's something I missed, there's something you want to know more about, or there's something I got totally wrong, do contact me on mastodon or via Email.