Ori Compiler Design

What This Document Covers

This is the design documentation for the Ori compiler — a multi-crate Rust codebase that compiles a statically-typed, expression-based language with Hindley-Milner type inference, automatic reference counting, and capability-based effects. The compiler features a dual backend: a tree-walking interpreter for rapid development and a full LLVM pipeline for native binaries and WebAssembly.

These documents serve the same purpose as a compiler textbook’s case study chapters: they explain not just what each component does, but why it was designed that way, what alternatives were considered, and how the pieces fit together. Each section opens with the general compiler concept — what problem it solves, what the classical approaches are — then shows how Ori applies and adapts those ideas.

The Anatomy of a Compiler

A compiler is a program that translates source code from one language to another — typically from a high-level language humans write to a low-level form machines execute. Every compiler, regardless of language, must solve the same fundamental problems:

  1. Lexical analysis — Breaking source text into tokens (words, operators, punctuation)
  2. Parsing — Organizing tokens into a tree structure that reflects the program’s grammar
  3. Semantic analysis — Checking that the program makes sense (types match, variables are defined, etc.)
  4. Optimization — Transforming the program to run faster or use less memory without changing its meaning
  5. Code generation — Producing the target output (machine code, bytecode, another language)

What makes each compiler interesting is the specific tradeoffs it makes at each stage. A compiler for a dynamically-typed language might skip semantic analysis entirely. A JIT compiler might skip optimization in favor of startup speed. A research compiler might explore novel type systems at the cost of compilation time.

Ori’s compiler makes several distinctive choices across these stages that are worth understanding before diving into the details.

What Makes Ori’s Compiler Distinctive

Dual Backend with Shared Canonical IR

Most compilers have one backend. Ori has two: a tree-walking interpreter for ori run (instant feedback during development) and a full LLVM pipeline for ori build (native performance for production). Both consume the same canonical IR — a sugar-free, type-annotated intermediate representation produced by a single canonicalization pass. This means desugaring, pattern compilation, and constant folding happen exactly once, regardless of which backend executes the result.

Incremental Everything via Salsa

The compiler is built on the Salsa framework, which provides automatic incremental computation. Every phase — lexing, parsing, type checking, evaluation — is a Salsa query whose result is memoized. When a source file changes, only the affected queries re-execute. This matters for IDE integration: editing one function doesn’t re-type-check the entire module.

ARC Memory Management (Not GC, Not Borrow Checking)

Ori uses automatic reference counting with compile-time optimizations inspired by Perceus (Reinking et al., 2021) and Lean 4. The ARC system performs borrow inference, liveness analysis, RC insertion/elimination, and reset/reuse — all as compiler passes over a dedicated ARC IR. This sits between the simplicity of garbage collection (no borrow checker for users to fight) and the determinism of manual memory management (no GC pauses).

Expression-Based with No Return

Ori is expression-based in the ML/Rust tradition: every construct produces a value, and the last expression in a block is that block’s value. There is no return keyword — the language deliberately omits it, recognizing return only in the lexer to produce a helpful error for users coming from other languages.

ConstructValue
Function bodyLast expression is the return value
if...then...elseEach branch is an expression
match armsEach arm is an expression
{ ... } blockLast expression (without ;) is the value
for...yieldCollected values form a list

Early exit is handled through ? (propagate errors), break (exit loops), and panic (diverge with Never type).

Lean Core, Rich Libraries

The compiler implements only constructs that require special syntax or static analysis. Everything else — data transformation, string utilities, collection operations — belongs in the standard library as regular method calls.

In CompilerIn Stdlib
{ } blocks, try { }, match (bindings, early return)map, filter, fold, find (collection methods)
recurse (self-referential self())retry, validate (library functions)
parallel, spawn, timeout (concurrency)
cache, with (capability-aware resources)

The test is simple: “Does this need special syntax or static analysis?” If not, it’s a library function. This keeps the compiler focused and allows the stdlib to evolve without compiler changes.

Capability-Based Effects

Side effects are tracked through a capability system. Functions declare what they need (uses Http, FileSystem), and callers must provide those capabilities (with Http = MockHttp in expr). This enables compile-time effect tracking and trivial mocking in tests — no dependency injection frameworks required.

Compilation Pipeline

flowchart TB
    A["Source File
    Salsa input"]
    B["Token List
    tokens() query"]
    C["Parse Result
    Module + ExprArena"]
    D["Typed Module
    expr_types + errors"]
    E["Canonical IR
    CanArena + DecisionTrees"]
    F["Eval Result
    Value + EvalOutput"]
    G["ARC IR
    borrow + liveness + RC"]
    H["LLVM IR
    Native Binary"]

    A --> B --> C --> D --> E
    E --> F
    E --> G --> H

    classDef frontend fill:#1e3a5f,stroke:#60a5fa,color:#dbeafe
    classDef canon fill:#3b1f6e,stroke:#a78bfa,color:#e9d5ff
    classDef interpreter fill:#1a4731,stroke:#34d399,color:#d1fae5
    classDef native fill:#5c3a1e,stroke:#f59e0b,color:#fef3c7

    class A,B,C,D frontend
    class E canon
    class F interpreter
    class G,H native

Each step is a Salsa query with automatic memoization. After canonicalization, the pipeline forks: the interpreter consumes canonical IR directly, while the ARC system lowers it to a basic-block SSA IR with explicit reference counting before LLVM codegen.

Compiler Architecture

The compiler is organized as a multi-crate Rust workspace. Dependencies flow strictly downward — later phases depend on earlier ones, never the reverse.

CratePurpose
ori_irCore IR types (AST, arena, interning, derives) — no dependencies
ori_diagnosticError reporting, DiagnosticQueue, suggestions, emitters
ori_lexer_coreRaw scanner, source buffer, token tags
ori_lexerTokenization (cooking layer over core)
ori_parseRecursive descent Pratt parser
ori_typesType system: Pool, InferEngine, unification, registries
ori_patternsPattern system, Value types, EvalError
ori_canonCanonical IR lowering (desugaring, pattern compilation, constant folding)
ori_evalTree-walking interpreter components
ori_arcARC analysis (borrow inference, RC insertion/elimination, reset/reuse)
ori_llvmLLVM backend for JIT and AOT compilation
ori_rtRuntime library for AOT binaries (C ABI, zero compiler deps)
ori_fmtSource code formatter (5-layer architecture)
oricCLI orchestrator, Salsa queries, reporting

Documentation Sections

Pipeline Stages (in compilation order)

Architecture (Section 01)

Intermediate Representation (Section 02)

Lexer (Section 03)

Parser (Section 04)

Type System (Section 05)

Pattern System (Section 06)

Canonicalization (Section 07)

Evaluator (Section 08)

ARC System (Section 09)

LLVM Backend (Section 10)

Subsystem Sections

Runtime (Section 11)

Formatter (Section 12)

Cross-Cutting Sections

Diagnostics (Section 13)

Testing (Section 14)

Platform Targets (Section 15)

Appendices