Type Representation

What Is Type Representation?

Every type checker needs a way to represent types — the internal data structure that answers questions like “is this an integer?”, “does this function take two arguments?”, “are these two types the same?”. The choice of representation determines how fast these questions can be answered, how much memory the type checker uses, and how well it integrates with caching and incremental compilation.

This choice matters more than it might seem. A type checker performs millions of type comparisons during inference: every time a function is called, every time an operator is applied, every time a pattern is matched, the checker compares types. The representation’s equality check is in the innermost loop of the most expensive compilation phase.

Compilers have converged on two broad approaches, each with clear tradeoffs.

Recursive Enum Trees

The most direct representation is a recursive enum, where each type variant owns its children:

enum Type {
    Int,
    Str,
    List(Box<Type>),
    Map(Box<Type>, Box<Type>),
    Function(Vec<Type>, Box<Type>),
    Struct { name: String, fields: Vec<(String, Type)> },
    Var(u32),  // Unresolved type variable
}

This is simple to build, simple to pattern-match, and directly mirrors the source-level type syntax. Most educational compilers and many production compilers start here. The problems emerge at scale: comparing two Function(Vec<Type>, Box<Type>) values for equality requires recursively comparing every parameter type and the return type. Hashing a type for use in a hash map requires traversing the entire structure. And identical types — say, [int] — may exist as thousands of separate heap allocations.

Interned Pool Representations

The alternative is interning: store each unique type once in a pool, represent it everywhere as a small integer index, and use index equality as a proxy for structural equality.

struct TypeIdx(u32);

fn same_type(a: TypeIdx, b: TypeIdx) -> bool {
    a.0 == b.0  // O(1), not structural traversal
}

Two types with the same structure always receive the same index, so comparing them is a single integer comparison. The pool deduplicates automatically — constructing list(int) twice returns the same index both times.

The cost is indirection: inspecting a type requires a pool lookup (pool[idx]), and constructing a type requires hashing its structure to check for duplicates. But for type checkers, where comparison vastly outnumbers construction, this tradeoff is overwhelmingly favorable.

Ori’s Type Representation

Ori uses a pool-based design with a distinctive tag-driven architecture. Types are represented at two levels: TypeId in the parser (a lightweight index for parsed type annotations) and Idx in the type checker (the full interned type handle). Both share the same primitive indices for seamless bridging.

TypeId — Parser-Level Type Handle

The parser produces TypeId(u32) values for type annotations in source code. TypeId knows about 12 primitive types and two special markers, but nothing about compound types — those are the type checker’s responsibility:

IndexConstantType
0INT64-bit signed integer
1FLOATIEEE 754 double
2BOOLBoolean
3STRUTF-8 string
4CHARUnicode scalar value
5BYTE8-bit unsigned
6UNITUnit type ()
7NEVERBottom type
8ERRORPoison type (error recovery)
9DURATIONTime duration
10SIZEMemory size
11ORDERINGComparison result
12INFERInference placeholder (not stored in pool)
13SELF_TYPESelf in trait/impl contexts (not stored in pool)

The INFER and SELF_TYPE markers are parser-level constructs that never enter the type pool. The bridge function resolve_type_id() maps TypeId → Idx — an identity mapping for primitives 0-11, followed by pool lookup for compound types starting at index 64 (FIRST_DYNAMIC).

Idx — The Universal Type Handle

Idx(u32) is the type handle used throughout the type checker, evaluator, and code generator. It shares the same primitive layout as TypeId (indices 0-11 are identical), with a sentinel value and dynamic range:

pub struct Idx(u32);

impl Idx {
    pub const INT: Self = Self(0);
    pub const FLOAT: Self = Self(1);
    // ... through ORDERING = 11
    pub const FIRST_DYNAMIC: u32 = 64;
    pub const NONE: Self = Self(u32::MAX);  // Sentinel
}

Comparing a type to int is idx == Idx::INT — a single integer comparison, no pool access needed. This matters because primitive type checks are among the most frequent operations in inference.

Tag — Type Kind Discriminant

Each type in the pool has a Tag(u8) that identifies its kind. Tags are organized into ranges, where the range determines how the tag’s companion data field is interpreted:

flowchart LR
    subgraph "Tag Ranges (u8)"
        P["0-15
        Primitives"]
        SC["16-31
        Simple Containers"]
        TC["32-47
        Two-Child"]
        CX["48-79
        Complex"]
        NM["80-95
        Named"]
        TV["96-111
        Variables"]
        TS["112-127
        Schemes"]
        SP["240-255
        Special"]
    end

    classDef frontend fill:#1e3a5f,stroke:#60a5fa,color:#dbeafe
    classDef canon fill:#3b1f6e,stroke:#a78bfa,color:#e9d5ff
    classDef native fill:#5c3a1e,stroke:#f59e0b,color:#fef3c7
    classDef interpreter fill:#1a4731,stroke:#34d399,color:#d1fae5

    class P frontend
    class SC,TC canon
    class CX,NM native
    class TV,TS,SP interpreter

The range-based organization enables fast category checks: tag < 16 means primitive, 16 <= tag < 48 means container, 96 <= tag < 112 means type variable. These are single integer comparisons — no match expression, no hash lookup.

Primitives (0-15)data is unused (always 0): Int=0 Float=1 Bool=2 Str=3 Char=4 Byte=5 Unit=6 Never=7 Error=8 Duration=9 Size=10 Ordering=11

Simple containers (16-31)data is the child Idx: List=16 Option=17 Set=18 Channel=19 Range=20 Iterator=21 DoubleEndedIterator=22

Two-child containers (32-47)data is an index into the extra array (2 consecutive entries): Map=32 Result=33 Borrowed=34

Complex types (48-79)data is an index into extra with a length prefix: Function=48 Tuple=49 Struct=50 Enum=51

Named types (80-95)data is an index into extra: Named=80 Applied=81 Alias=82

Type variables (96-111)data is a variable ID indexing var_states: Var=96 BoundVar=97 RigidVar=98

Type schemes (112-127)data is an index into extra: Scheme=112

Special (240-255) — compiler-internal markers: Projection=240 ModuleNs=241 Infer=254 SelfType=255

Item — The 5-Byte Type Entry

Each type in the pool is stored as an Item — a compact pair of tag and data:

#[repr(C)]
pub struct Item {
    pub tag: Tag,   // 1 byte — which kind of type
    pub data: u32,  // 4 bytes — interpretation depends on tag
}

Logically 5 bytes, padded to 8 by Rust’s alignment rules. The data field’s meaning is entirely determined by the tag’s range — a design that trades type safety for compactness. A simple container’s data is a child Idx. A complex type’s data is an offset into the extra array. A type variable’s data is a variable ID.

This encoding means the pool’s items: Vec<Item> is a contiguous array of 8-byte entries — compact enough that the entire type pool for a typical file fits in L2 cache.

Pool — Unified Type Storage

The Pool is the central data structure of the type checker. All types, variables, and metadata live here:

pub struct Pool {
    items: Vec<Item>,                    // Type entries (parallel arrays below)
    flags: Vec<TypeFlags>,               // Pre-computed metadata
    hashes: Vec<u64>,                    // Stable Merkle hashes
    extra: Vec<u32>,                     // Variable-length data
    intern_map: FxHashMap<u64, Idx>,     // Deduplication by hash
    resolutions: FxHashMap<Idx, Idx>,    // Named type → concrete type
    var_states: Vec<VarState>,           // Type variable state
    next_var_id: u32,                    // Fresh variable counter
}

The pool uses Merkle hashing for interning: each type’s hash is computed from its tag, data, and the hashes of its children (recursively). This means list(int) always produces the same hash regardless of when it’s constructed, and the intern_map can detect duplicates in O(1) amortized time.

Initialization: Pool::new() pre-interns 12 primitives at fixed indices 0-11 and pads to FIRST_DYNAMIC (64). This guarantees that Idx::INT, Idx::STR, etc. are valid indices without any construction.

Extra Array — Variable-Length Data

Complex types store their variable-length data (parameter lists, field lists, variant lists) in the shared extra: Vec<u32> array. The data field in Item points to the start position:

TypeExtra Layout
Function[param_count, p0, p1, ..., return_type]
Tuple[elem_count, e0, e1, ...]
Struct[name_lo, name_hi, field_count, f0_name, f0_type, ...]
Enum[name_lo, name_hi, variant_count, v0_name, v0_field_count, ...]
Named[name_lo, name_hi]
Applied[name_lo, name_hi, arg_count, arg0, arg1, ...]
Scheme[var_count, var0, var1, ..., body_type]
Map/Result[child0, child1]

This shared-array design avoids per-type heap allocations for parameter lists and field lists. A function type with 3 parameters occupies 5 entries in extra ([3, p0, p1, p2, ret]) — no Vec, no Box, no separate allocation.

Type Variables and Unification

Type variables (tags Var, BoundVar, RigidVar) use the var_states array to track their state during inference:

pub enum VarState {
    Unbound { id: u32, rank: Rank, name: Option<Name> },
    Link { target: Idx },
    Rigid { name: Name },
    Generalized { id: u32, name: Option<Name> },
}

Unification links variables by setting VarState::Link { target }. The resolve() method follows link chains to the final target, with path compression to amortize future lookups — the classic union-find optimization.

TypeFlags — O(1) Property Queries

TypeFlags is a 32-bit bitfield computed once when a type is interned. Flags propagate from children to parents via bitwise OR during construction:

Bits 0-7:   Presence    — HAS_VAR, HAS_ERROR, HAS_INFER, HAS_SELF, ...
Bits 8-15:  Category    — IS_PRIMITIVE, IS_CONTAINER, IS_FUNCTION, ...
Bits 16-23: Optimization — NEEDS_SUBST, IS_RESOLVED, IS_MONO, IS_COPYABLE
Bits 24-31: Capability  — HAS_CAPABILITY, IS_PURE, HAS_IO, HAS_ASYNC

The propagation rule is: when constructing list(T), the list’s flags include T’s flags (via OR) plus the list’s own category flags. This means checking whether a deeply nested type like [[Option<Result<T, E>>]] contains any unresolved type variables is a single bitwise AND: flags.contains(HAS_VAR). No traversal needed, regardless of nesting depth.

This pattern comes directly from rustc’s TypeFlags, which uses the same propagation strategy for the same reason.

Type Construction and Queries

The pool provides typed builder methods for construction and O(1) accessors for queries:

impl Pool {
    // Construction — all return interned, deduplicated Idx
    pub fn list(&mut self, elem: Idx) -> Idx { ... }
    pub fn map(&mut self, key: Idx, value: Idx) -> Idx { ... }
    pub fn function(&mut self, params: &[Idx], ret: Idx) -> Idx { ... }
    pub fn struct_type(&mut self, name: Name, fields: &[(Name, Idx)]) -> Idx { ... }
    pub fn fresh_var(&mut self) -> Idx { ... }

    // Queries — O(1) via direct indexing
    pub fn tag(&self, idx: Idx) -> Tag { ... }
    pub fn flags(&self, idx: Idx) -> TypeFlags { ... }
    pub fn list_elem(&self, idx: Idx) -> Idx { ... }
    pub fn function_params(&self, idx: Idx) -> &[u32] { ... }
    pub fn resolve(&self, idx: Idx) -> Idx { ... }
}

Type display for error messages (pool.format(idx)) resolves Idx values to human-readable strings like [int], {str: int}, (int, int) -> bool, using the StringLookup trait to resolve interned Name values.

Design Tradeoffs

Pool-per-File

Each source file gets its own Pool. Types from different files are bridged via Merkle hash lookup — when importing a function from module B, the type checker computes the Merkle hash of the function’s type in module B’s pool and looks it up in module A’s pool. If the hash matches, the existing Idx is reused; otherwise, the type is re-interned into module A’s pool.

This per-file isolation simplifies incremental compilation (changing file A doesn’t invalidate file B’s pool) but means that the same type can have different Idx values in different pools. Cross-pool comparisons must always go through the hash-based bridge, not raw Idx equality.

Tag Safety vs Compactness

The Item { tag, data } encoding packs maximum information into 5 bytes, but the data field’s meaning is determined entirely by the tag — a data value of 42 could be a child Idx, an extra offset, or a variable ID depending on the tag. Misinterpreting the data field is a logic bug that the type system can’t catch.

In practice, this is safe because all access goes through typed accessor methods (list_elem(), function_params(), var_state()) that check the tag before interpreting the data. The compactness benefit — the entire type pool fitting in cache — justifies the loss of static type safety on the internal representation.

No Shrinking

Like the expression arena, the type pool only grows. Type entries are never deleted. This is acceptable because pools are per-file and per-session — the pool is dropped when the file’s compilation results are invalidated, reclaiming all memory at once.

Prior Art

Zig — InternPool

Zig’s InternPool is the most direct influence on Ori’s Pool design. Zig unifies type interning, string interning, and value interning into a single flat storage with tag-driven dispatch and an extra array for variable-length data. The structural similarity is deliberate: both systems use a (tag, data) pair as the core entry format, both use a shared extra array for complex types, and both use hash-based deduplication. Zig’s design demonstrated that this architecture scales to a production compiler handling millions of types.

rustc — TyKind and TypeFlags

Rust’s compiler represents types via TyKind, an enum with variants for every type kind, interned through a thread-local context. rustc’s TypeFlags — pre-computed bitflags propagated via OR during construction — directly inspired Ori’s TypeFlags. The flag categories (presence, structural, optimization) and the propagation strategy are essentially identical.

GHC — Unique-Based Interning

GHC (the Glasgow Haskell Compiler) uses a Unique identifier for type interning. Each type constructor and type variable gets a globally unique integer, enabling O(1) comparison. GHC’s approach is more fine-grained than Ori’s pool-based interning — individual type constructors are unique, but compound types may still require structural comparison. Ori’s pool-level interning goes further by making even compound types like (int, int) -> bool comparable by index.

Lean 4 — IRType for ARC Classification

Lean 4’s IRType classifies types for ARC analysis into categories similar to Ori’s ArcClass (scalar vs. reference). Lean’s type classification is simpler than Ori’s Tag ranges — it primarily distinguishes between types that need reference counting and types that don’t — but the underlying insight is the same: compact type metadata enables fast dispatch in the code generator without full type analysis.