Runtime Overview
What Is a Language Runtime?
Every compiled language needs a bridge between the machine code it produces and the services that code relies on at execution time. This bridge is the runtime library — a collection of functions, data structures, and conventions that the compiler’s generated code calls into for operations it cannot (or should not) perform inline. The compiler emits call ori_rc_dec(ptr), and somewhere, a real function must exist to handle that call. The runtime is where that function lives.
The scope of a runtime library varies enormously across languages. C’s runtime is minimal: malloc, free, printf, and a handful of startup/shutdown functions. The compiler generates self-contained machine code for arithmetic, control flow, and memory access; the runtime provides only what the hardware cannot. At the other extreme, Java’s runtime is a full virtual machine — a bytecode interpreter, garbage collector, class loader, JIT compiler, thread scheduler, and standard library, all bundled together. The “compiled” program is bytecode that cannot execute without the runtime present.
Most production languages fall somewhere between these poles. Go’s runtime includes a garbage collector and goroutine scheduler but no bytecode interpreter. Rust’s runtime is minimal (just the allocator and panic infrastructure) because the borrow checker eliminates the need for runtime memory management. Swift’s runtime includes reference counting operations, type metadata for dynamic dispatch, and protocol witness tables — similar in scope to what Ori needs.
The ARC Runtime Pattern
Languages that use automatic reference counting (ARC) as their memory management strategy need a specific kind of runtime. The compiler statically determines where reference count operations must be inserted, but the operations themselves — allocating with a header, incrementing atomically, decrementing with cleanup — are too complex and too shared to inline everywhere. They live in the runtime.
Swift pioneered this pattern in a production setting. The Swift runtime provides swift_retain and swift_release (the equivalents of ori_rc_inc and ori_rc_dec), along with type metadata, protocol conformance tables, and heap object management. Lean 4 follows a similar pattern with its lean_inc_ref and lean_dec_ref functions. In both cases, the compiler’s static analysis determines where to call these functions, and the runtime provides what those functions do.
Ori’s ori_rt crate follows this pattern: the ARC analysis pass determines statically where reference count operations belong, and the runtime provides the atomic increment/decrement implementations, copy-on-write collection mutations, string operations, and I/O primitives that the generated code calls into.
What Makes Ori’s Runtime Distinctive
Zero Compiler Dependencies
The ori_rt crate has no dependencies on the compiler. It does not import ori_ir, ori_types, ori_parse, or any other compiler crate. It links only against the Rust standard library and the system allocator. This is not an accident — it is a hard architectural constraint that keeps the runtime minimal and ensures that changes to the compiler’s internal representations never ripple into the runtime.
The contract between compiler and runtime is entirely defined by C-ABI function signatures. The LLVM backend emits call @ori_rc_dec(ptr, drop_fn), and the runtime provides a function with that exact name and calling convention. Neither side knows about the other’s internal types.
Dual Build Artifacts
The crate builds as both an rlib (Rust library) and a staticlib (C-compatible archive):
libori_rt.rlib— Used byori_llvmfor JIT execution. The LLVM execution engine resolves runtime function addresses directly from the loaded Rust library, enablingori runto call runtime functions without a separate linking step.libori_rt.a— Linked into AOT-compiled binaries by the system linker. Whenori buildproduces a native executable, the linker resolves allori_*symbols against this static archive.
Both artifacts are built by cargo b (debug) or cargo b --release (release). This dual-output design means the same runtime code serves both the development workflow (JIT) and the production workflow (AOT), eliminating the class of bugs where the JIT runtime behaves differently from the AOT runtime.
Data Pointer Convention
RC allocations return a data pointer — a pointer to the user data region, past the 16-byte header — rather than a pointer to the allocation base. This seemingly small decision has deep consequences:
- Generated code passes data pointers directly to C FFI without adjustment
- Every RC operation recovers the header by subtracting a fixed offset (
ptr - 16for the count,ptr - 8for the size) - The data pointer is the value — no wrapping, no indirection, no fat pointer needed
This matches Swift’s approach, where HeapObject* points to the object data (past the metadata/refcount header), and contrasts with CPython, where PyObject* points to the header and callers must offset to reach the data.
Null Sentinels for Empty Collections
Empty lists, maps, and sets use a null data pointer with zero length and zero capacity. No allocation occurs until the first element is added. The runtime makes ori_rc_inc(null) and ori_rc_dec(null) explicit no-ops, so empty collections flow through the entire RC protocol without special-casing at every call site.
This means creating an empty list is free (24 bytes of zeros on the stack), passing it around is free (no RC operations on null), and dropping it is free (the no-op dec). The first push triggers the initial allocation with MIN_COLLECTION_CAPACITY = 4 elements.
Consuming COW Semantics
Every mutating collection operation takes ownership of the caller’s reference to the data buffer. The caller passes its reference in and receives a new {len, cap, data} triple through an sret output pointer. After the call, the caller must not access the original buffer.
This consuming protocol enables the fast path: when the reference count is 1, the runtime mutates the buffer in place and returns the same pointer. No copy, no RC changes — the sole reference transfers from input to output. On the slow path (shared buffer), the runtime copies, increments element RCs on the copy, and decrements the old buffer’s RC. The consuming protocol makes the fast path a zero-cost operation rather than a copy-then-dec.
SSO Strings as First-Class Values
Strings of 23 bytes or fewer are stored entirely inline in the 24-byte OriStr struct — no heap allocation, no reference counting, no cleanup. An SSO string has the same copy cost as a 24-byte memcpy and zero drop cost. This makes short strings (identifiers, error codes, format fragments) as cheap as primitive values in terms of memory management overhead.
Architecture
The runtime sits at the bottom of the compilation pipeline. The LLVM backend emits call instructions targeting ori_rt’s #[no_mangle] extern "C" functions. These calls are resolved at link time (AOT) or symbol resolution time (JIT).
flowchart TB
Source["Source .ori"] --> Parse["Parse"]
Parse --> TypeCheck["Type Check"]
TypeCheck --> Canon["Canonicalize"]
Canon --> ARC["ARC Analysis
RC insertion"]
ARC --> LLVM["LLVM Codegen
call @ori_rc_dec
call @ori_list_push_cow
call @ori_str_concat"]
LLVM --> Link["Link against
libori_rt.a"]
Link --> Binary["Native Binary"]
LLVM --> JIT["JIT resolve against
libori_rt.rlib"]
JIT --> Exec["Direct Execution"]
classDef frontend fill:#1e3a5f,stroke:#60a5fa,color:#dbeafe
classDef canon fill:#3b1f6e,stroke:#a78bfa,color:#e9d5ff
classDef native fill:#5c3a1e,stroke:#f59e0b,color:#fef3c7
classDef interpreter fill:#1a4731,stroke:#34d399,color:#d1fae5
class Source,Parse,TypeCheck frontend
class Canon,ARC canon
class LLVM,Link,JIT native
class Binary,Exec interpreter
The runtime never calls back into the compiler. Data flows one way: compiled code calls runtime functions, the runtime operates on raw memory, and results are returned through C ABI conventions — return values for small results, sret output pointers for aggregates larger than 16 bytes, or in-place mutation for COW fast paths.
Module Organization
The runtime is organized into functional modules, each responsible for a category of operations:
flowchart TB
RT["ori_rt"] --> RC["rc/
Allocation, inc, dec
Uniqueness, tracing
Collection RC helpers"]
RT --> List["list/
COW mutations
Seamless slices
Sort, structural ops
Reset/reuse"]
RT --> Map["map/
Split-buffer COW
Key lookup
Insert, remove, update"]
RT --> Set["set/
Contiguous COW
Union, intersection
Difference"]
RT --> Str["string/
SSO layout
COW concat
Methods, conversion"]
RT --> Fmt["format/
Template interpolation
Spec parsing
Type formatters"]
RT --> Iter["iterator/
Opaque handles
Source + adapter variants
Consumer operations"]
RT --> IO["io.rs
Print, panic
Catch/recover
Entry point wrapper"]
RT --> Slice["slice_encoding/
Negative-cap encoding
Offset recovery"]
classDef native fill:#5c3a1e,stroke:#f59e0b,color:#fef3c7
class RT,RC,List,Map,Set,Str,Fmt,Iter,IO,Slice native
Function Categories
The runtime exports approximately 80 C-ABI functions. They fall into six categories:
| Category | Functions | Purpose |
|---|---|---|
| Memory | ori_alloc, ori_free, ori_realloc | Raw allocator wrappers |
| Reference Counting | ori_rc_alloc, ori_rc_inc, ori_rc_dec, ori_rc_is_unique, … | RC lifecycle (see Reference Counting) |
| Collection COW | ori_list_push_cow, ori_map_insert_cow, ori_set_union_cow, … | Copy-on-write mutations (see Collections & COW) |
| String Operations | ori_str_concat, ori_str_split, ori_str_eq, … | SSO-aware string handling (see String SSO) |
| Format | ori_format_int, ori_format_float, ori_format_str, … | Template string interpolation |
| I/O and Panic | ori_print, ori_panic, ori_run_main, ori_catch_recover, … | Output, error handling, entry point |
C ABI Design Decisions
All runtime functions use #[no_mangle] extern "C" for cross-language compatibility. Several design decisions shape the calling conventions:
sret output pattern. Functions returning collections write results through an out_ptr parameter rather than returning by value. OriList, OriMap, and OriStr are all 24 bytes — above the 16-byte threshold for register return on x86-64 System V ABI. Explicit sret gives the codegen control over the destination address, which is essential for correct integration with LLVM’s alloca/store/load pattern.
Function pointer callbacks. COW operations accept inc_fn (element RC increment), elem_dec_fn (element RC decrement), key_eq (key equality), and comparator callbacks as C function pointers. The LLVM backend generates type-specialized trampolines for each concrete type. This keeps the runtime entirely type-agnostic — it never needs to know what type the elements are, only how to increment, decrement, compare, or drop them.
Consuming semantics. Every COW mutation function takes ownership of the caller’s reference to the data buffer. This is not just a convention — it is load-bearing for correctness. The fast path (unique owner) mutates in place and returns the same pointer without any RC changes. If the convention were borrowing (caller retains its reference), the fast path would need an extra increment to hand back the reference, and the common case would pay an atomic operation it does not need.
Build Modes
The crate supports one feature flag:
single-threaded— Substitutes non-atomici64reads/writes forAtomicI64operations. This eliminates atomic operation overhead on programs that do not use task parallelism. The flag is compile-time only — there is no runtime check.
Debugging and Diagnostics
The runtime provides three environment-variable-controlled diagnostic modes that compose freely:
ORI_TRACE_RC=1 logs every RC operation (alloc, inc, dec, free) to stderr with pointer addresses and count transitions. The verbose setting adds stack backtraces to each operation. The trace check uses OnceLock to read the environment variable once and cache the result — the cost when disabled is a single always-not-taken branch per RC operation.
[RC] alloc 0x7f8a1c000b70 size=48 count=1
[RC] inc 0x7f8a1c000b70 count=1->2
[RC] dec 0x7f8a1c000b70 count=2->1
[RC] dec 0x7f8a1c000b70 count=1->0 (dropping)
[RC] free 0x7f8a1c000b70 size=48
ORI_RT_DEBUG=1 enables runtime assertions that validate RC headers on every operation — catching use-after-free, double-free, and header corruption. Debug builds additionally track freed pointers in a HashSet for double-free detection.
ORI_CHECK_LEAKS=1 counts live RC allocations via a global atomic counter. At program exit, an atexit handler reports unfreed allocations. Exit code 2 indicates a detected leak. Debug builds track allocation sites (pointer, size, alignment) for attribution.
These modes compose: ORI_TRACE_RC=1 ORI_CHECK_LEAKS=1 ORI_RT_DEBUG=1 ./binary enables all three simultaneously. All are zero-cost when disabled — the first access caches the environment variable, and subsequent checks are a single branch on a cached boolean.
Prior Art
Swift’s runtime is the closest analog. Swift uses ARC with a similar two-word header (metadata pointer + refcount), swift_retain/swift_release for RC operations, and copy-on-write semantics for its standard library collections (Array, Dictionary, Set). The key differences: Swift’s runtime includes type metadata for dynamic dispatch and protocol witness tables — capabilities Ori does not need because it uses monomorphization. Swift’s refcount also packs additional bits (pinned flag, unowned count) into the refcount word, while Ori uses a simpler single-counter design.
Lean 4’s runtime implements reference counting for a functional language with similar goals. Lean’s lean_object header contains a refcount and a tag byte for type discrimination. Lean’s RC operations (lean_inc_ref, lean_dec_ref) follow the same Relaxed-increment / Release-decrement / Acquire-fence-before-drop synchronization protocol that Ori uses. Lean also implements reset/reuse optimization at the runtime level — detecting unique ownership and recycling allocations — which Ori handles at the ARC IR level instead.
Koka’s runtime (kklib) provides the execution support for Koka’s Perceus reference counting. Like Ori, Koka uses a C-compatible runtime with reference counting primitives. Koka’s approach is distinctive in that the compiler generates C code rather than LLVM IR, so the runtime is a C library rather than a Rust crate. Koka also uses a thread-local heap with bump allocation, while Ori uses the system allocator.
CPython’s runtime uses non-atomic reference counting (the GIL provides thread safety). CPython’s Py_INCREF/Py_DECREF are conceptually similar to Ori’s operations but use a different header layout — the refcount is the first field of PyObject, and the type pointer follows. CPython’s cycle detector (for reference cycles in arbitrary object graphs) has no analog in Ori, where value semantics prevent cycles by construction.
Rust has almost no runtime. The Rust standard library provides alloc::alloc and the panic infrastructure, but no reference counting primitives — Arc is a library type with inline operations, not a runtime service. This minimal approach is possible because the borrow checker eliminates the need for runtime memory management. Ori’s runtime is larger because ARC requires runtime support that static ownership analysis does not.
Design Tradeoffs
Rust crate vs C library. Ori’s runtime is written in Rust and compiled as a static library, while Koka’s kklib and Lean 4’s runtime are written in C. The Rust choice provides memory safety within the runtime itself (important when the runtime manipulates raw pointers on behalf of generated code), access to Rust’s standard library for complex operations (sorting, UTF-8 handling, formatting), and the ability to share the runtime as an rlib for JIT mode. The cost is a Rust toolchain dependency for building the runtime.
Atomic vs non-atomic refcounts. The default is atomic operations (AtomicI64), with a single-threaded feature flag for non-atomic mode. The alternative — always non-atomic with a runtime lock for concurrent access — would be simpler but would make concurrent programs pay for lock acquisition on every RC operation. The per-program feature flag lets single-threaded programs avoid atomic overhead entirely, while concurrent programs pay only the atomic operation cost (which is free on x86 due to cache coherency and maps to lightweight barriers on ARM).
Linear scan vs hash tables for maps. Ori’s map runtime uses linear key scan with an equality callback, not hash tables. This is O(n) per lookup, which is efficient for small maps (the common case) but degrades for large maps. The rationale: hash-based maps would require the runtime to know the key’s hash function, adding another callback parameter to every map operation and complicating the COW protocol. Linear scan keeps the implementation simple and makes the common case (maps with fewer than ~20 entries) fast. A future optimization could add hash-based lookup for maps above a size threshold.
No auto-shrink. Collections retain their capacity even after elements are removed. This avoids the performance cliff of alternating growth and shrinkage around a capacity boundary (the “ping-pong” problem). The cost is wasted memory for collections that grow large and then shrink. This matches the behavior of Rust’s Vec, Go’s slices, and Java’s ArrayList.
Seamless slices via negative capacity. Rather than introducing a separate slice type, Ori encodes slice state in the capacity field’s sign bit. This keeps the OriList struct at 24 bytes and means slices flow through the same code paths as regular lists (with a branch at the COW decision point). The alternative — a separate OriSlice type — would avoid the branch but require the compiler to track which type each variable holds, complicating the generated code.
Related Documents
- Reference Counting — RC header layout, atomic operations, synchronization model
- Collections & COW — Copy-on-write mutation protocol, list/map/set operations
- String SSO — Small string optimization, SSO/heap discrimination, COW string operations
- Data Structures — Memory layouts for OriList, OriMap, OriSet, OriStr, iterators
- ARC System — The analysis pass that determines where RC operations are inserted
- LLVM Backend — The code generator that emits calls to runtime functions