Section 05: Register-Based VM
Status: Not Started Goal: A bytecode execution engine that runs the instruction set from Section 04 in a tight loop. Concrete target: ≤0.5µs per function call on the measured dev machine, with Python 3.12 performance treated as the stretch comparison rather than the only acceptable number. A(3,8) should complete in under 1 second on the measured host (Python: 0.386s).
Context: The bytecode compiler (Section 04) flattens the recursive CanExpr tree into linear instructions. This section builds the engine that executes those instructions. The critical insight: a well-written dispatch loop spends almost all its time in a single function — the VM loop. No recursive calls, no arena lookups, no pattern matching on expression kinds. Just fetch → decode → execute → repeat.
The dispatch loop is the single most performance-critical piece of code in the entire compiler. Every nanosecond per dispatch multiplied by millions of instructions.
Reference implementations:
- Lua 5.4
lvm.c: Single ~3000-line C functionluaV_execute()with aswitchon opcodes. Register-based — operands are register indices, not stack positions. The gold standard for interpreter performance. - CPython 3.12
Python/ceval.c:_PyEval_EvalFrameDefault()with computed goto dispatch (USE_COMPUTED_GOTOS). Stack-based but with specializing adaptive interpreter. Our measured target: 0.14µs/call. - Roc
src/eval/interpreter.zig: Flat dispatch over instruction tags. Arena-threaded operands.
Depends on: Section 04 (bytecode compilation — provides the Chunk and OpCode types).
05.1 VM Core and Dispatch Loop
File(s): compiler/ori_eval/src/bytecode/vm.rs (new)
The heart of the interpreter. A single function that loops over bytecode instructions and dispatches each one. Every allocation, branch miss, or cache miss in this loop costs millions of cycles over a program’s execution.
-
Implement
VMstruct:pub struct VM { /// The register file — contiguous pre-allocated storage of Values. /// Each call frame gets a window into this array. registers: Vec<Value>, /// Call frame stack (pre-allocated, never cloned). frames: Vec<CallFrame>, /// Current frame pointer (index into `frames`). frame_count: usize, /// Global variables. globals: FxHashMap<Name, Value>, /// Open upvalues (for closure capture). open_upvalues: Vec<UpvalueRef>, } struct CallFrame { /// The chunk being executed. /// SAFETY: raw pointer avoids lifetime parameter on CallFrame. /// Invariant: chunk outlives the frame (enforced by VM::execute taking &Chunk). /// Consider a safe alternative: store an index into a chunk table instead. chunk: *const Chunk, // borrowed, not owned /// Instruction pointer (index into chunk.code). ip: usize, /// Base register offset in the register file. base: usize, /// Call metadata for backtraces and runtime diagnostics. call_name: Name, call_span: Option<Span>, } -
Implement the dispatch loop — the most critical code:
impl VM { pub fn execute(&mut self, chunk: &Chunk) -> EvalResult { let mut ip: usize = 0; let code = &chunk.code; let constants = &chunk.constants; let base = self.current_base(); loop { let instr = code[ip]; ip += 1; let op = OpCode::from_byte((instr & 0xFF) as u8); let a = ((instr >> 8) & 0xFF) as usize + base; let b = ((instr >> 16) & 0xFF) as usize + base; let c = ((instr >> 24) & 0xFF) as usize; match op { OpCode::LoadConst => { // ABx format: [opcode:8][A:8][Bx:16] // A = destination register, Bx = constant pool index let dest = ((instr >> 8) & 0xFF) as usize + base; let bx = ((instr >> 16) & 0xFFFF) as usize; self.registers[dest] = constants[bx].clone(); } OpCode::AddInt => { let lhs = self.registers[b].as_int(); let rhs = self.registers[c].as_int(); self.registers[a] = Value::Int(lhs + rhs); } OpCode::Call => { // Push new frame, adjust base, continue // ... (see 05.2) } OpCode::Return => { let result = std::mem::replace( &mut self.registers[a], Value::Void, ); // Pop frame, restore ip and base self.frame_count -= 1; if self.frame_count == 0 { return Ok(result); } let frame = &self.frames[self.frame_count - 1]; ip = frame.ip; // Place result in caller's destination register // ... } OpCode::TailCall => { // Reuse current frame — no push/pop // Reset ip to 0, rebind parameters in-place } // ... all other opcodes } } } } -
Value representation: The VM uses the same
Valueenum as the tree-walker. This is critical for compatibility — built-in functions, method dispatch, and the pattern system all operate onValue. Do NOT invent a separate VM-internal value type. The register file isVec<Value>. This meansValue::clone()costs (Arc bumps for heap-backed values) still apply, but the elimination of per-call Environment/CallStack allocation is the dominant win. -
Dispatch strategy: Start with
match(Rust’s switch). Profile. If branch prediction is a bottleneck, evaluate:- Computed goto via unsafe function pointer table (Rust equivalent of C’s
&&labelextension) - Direct threading via
#[cold]/#[inline(always)]hints on rare/common arms - Measure: if
matchdispatch costs >2ns/instruction, switch to function pointer table
- Computed goto via unsafe function pointer table (Rust equivalent of C’s
-
Register file sizing: Pre-allocate
Vec<Value>with capacity for max expected nesting. Typical: 256 registers × 64 max frames = 16,384 slots. Grows only on deep recursion. -
Instruction decode: The 32-bit decode (shifts + masks) should compile to 3-4 x86 instructions. Verify with
cargo asmordisasm-ori.sh. -
Benchmark the bare dispatch loop: Measure cycles per instruction dispatch with a no-op program (tight loop of
Moveinstructions). Target: <5ns per dispatch. Usecargo asm -p ori_eval --lib -- "VM::execute"to verify the hot loop compiles to a tight switch with no unexpected function calls. -
Function lookup architecture: The VM must resolve
Value::Function(FunctionValue)to compiledChunks. TheFunctionValuecarries aCanId(body reference) andSharedCanonResult(the canon IR). The bytecode compiler produces aChunkper function; the VM needs a way to mapFunctionValue->Chunkat call time. Options: (a) eagerly compile all functions at module load time, storeChunks in a lookup table keyed byCanId; (b) lazily compile on first call, cache in the VM; (c) store theChunkdirectly in a newValue::BytecodeFunctionvariant. Option (a) is simplest and avoids runtime compilation overhead. -
Carry a separate control-flow stack if needed: The current evaluator uses
ControlAction(4 variants:Error,Break(Value),Continue(Value),Propagate(Value)) as RustResult::Errto unwind the Rust call stack. In the bytecode VM there is no Rust recursion to unwind. The VM must implement:- Break/Continue with labels:
CanExpr::Break { label, value }andCanExpr::Continue { label, value }can target named loops (loop:name,for:name). The bytecode compiler must resolve label targets to jump offsets. Forbreak:nametargeting an outer loop, the VM must unwind any inner loops between the break site and the target loop. - Break with value:
break valueexits afor...yieldloop, appending a final value to the collection. The VM must handle this at the yield collection level. - Continue with value for yield:
continue valueinfor...yieldsubstitutes a value. The VM must pipe this through the yield accumulator. - Propagate across frames:
?operator emitsPropagate(Value)which crosses function call boundaries (caught bycatch_propagation()at the caller). In the VM, this means popping the current frame and checking the caller’s propagation handler. - Error with backtrace:
Error(Box<EvalError>)carries backtrace data. The VM must be able to capture frame information for backtraces when errors occur.
- Break/Continue with labels:
-
Safety decision for
*const ChunkinCallFrame: The sample design uses a raw pointer to avoid lifetime parameters onCallFrame. Evaluate alternatives: (a) chunk index into aVec<Chunk>(safe, one indirection), (b)&'a Chunkwith lifetime onVM<'a>(safe, constrains VM lifetime), (c) raw pointer with// SAFETY:invariant (performance, requires unsafe). The chunk table index (option a) is recommended for correctness-first development; raw pointer can be introduced later if profiling shows the indirection matters. -
Matrix tests (in
compiler/ori_eval/src/bytecode/vm/tests.rs):- Dimensions: opcode category (load/store, arithmetic, comparison, control flow, call/return) x value type (int, float, bool, str, void) x register count (few: 3, moderate: 20, pressure: 250)
- Semantic pin:
LoadConst(r0, 42)followed byReturn(r0)producesValue::Int(42) - Negative pin: executing past end of chunk code produces clean error (not UB or panic)
- TDD ordering: implement VM struct and dispatch loop skeleton FIRST, write tests for LoadConst/Return/Move, verify they fail (VM doesn’t execute yet), then implement opcode handlers incrementally
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (05.1) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-05.1 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 05.1: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
05.2 Call Frames and Scoping
File(s): compiler/ori_eval/src/bytecode/vm.rs
Function calls in the bytecode VM are fundamentally different from the tree-walker: no environment allocation, no interpreter construction, no cloning. A call pushes a frame (3 words), adjusts the register window, and jumps to ip=0 of the callee’s chunk.
-
Implement
Callopcode:OpCode::Call => { let func_reg = b; let arg_start = c; let arg_count = /* encoded in instruction or next word */; let callee = &self.registers[func_reg]; let chunk = /* extract compiled chunk from Value::Function / MemoizedFunction */; // Save current frame state self.frames[self.frame_count - 1].ip = ip; // Push new frame let new_base = func_reg + 1; // args already in registers after func self.frames[self.frame_count] = CallFrame { chunk: chunk as *const Chunk, ip: 0, base: new_base, }; self.frame_count += 1; // Switch to callee ip = 0; // base is now new_base (handled by frame lookup) } -
Implement
TailCallopcode: Reuse the current frame. Move arguments into the current frame’s parameter slots, reset ip to 0. This is how Ackermann’s outer recursive call becomes a loop — zero stack growth.OpCode::TailCall => { // Move args into current frame's parameter slots let frame = &mut self.frames[self.frame_count - 1]; // Copy arg registers to param registers // Reset ip to 0 ip = 0; // No frame push — same frame, same base }Caveat: Ackermann uses multi-clause dispatch (
@ack (0: int, n: int),@ack (m: int, 0: int),@ack (m: int, n: int)). In the current evaluator, multi-clause dispatch compiles to aDecisionTreeevaluated at each call entry. For tail calls to work, the VM must re-enter the decision tree at ip=0, not just the body of the current clause. This means TailCall targets the function entry point (including pattern matching), not the clause body. The outer callack(m: m - 1, n: ack(m:, n: n - 1))is NOT a tail call (innerackmust complete first), butack(m: m - 1, n: 1)in clause 2 IS a tail call. -
Implement
Returnopcode: Pop frame, place result in caller’s destination register, restore ip. -
Implement scoping for
letbindings: Local variables are registers.let x = exprcompiles to instructions that place the result in a specific register. No scope push/pop — register allocation handles it at compile time. -
Implement closure upvalue access:
GetUpvalue/SetUpvalueaccess the closure’s captured values. These are stored in the closure object, not in a scope chain. -
Stack overflow protection: Check
frame_count < MAX_FRAMESon everyCall. If exceeded, return a clean error (not a Rust stack overflow). -
Matrix tests (in
compiler/ori_eval/src/bytecode/vm/tests.rs):- Dimensions: call type (regular, tail, mutual recursion, closure-to-closure) x depth (shallow: 5, medium: 100, deep: 1000+, overflow: max_frames+1) x return path (normal, error propagation, break across frames)
- Semantic pin: Ackermann A(3,5) = 253 via bytecode VM (identical to tree-walker result)
- Semantic pin (tail call): tail-recursive countdown(1000000) completes with frame_count never exceeding 2 (verified by instrumentation)
- Negative pin:
Callwhenframe_count >= MAX_FRAMESreturnsStackOverflowerror (not Rust panic) - TDD ordering: write Call/Return/TailCall tests FIRST, verify they fail, then implement frame management
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (05.2) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-05.2 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 05.2: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
05.3 Method Dispatch and Built-in Integration
File(s): compiler/ori_eval/src/bytecode/vm.rs, compiler/ori_eval/src/bytecode/builtins.rs (new)
The VM must integrate with Ori’s existing method dispatch system (MethodDispatcher with priority chain: UserRegistry → CollectionMethod → BuiltinMethod) and built-in functions (print, assert_eq, etc.).
-
Built-in function calls:
FunctionValvalues (registered infunction_val.rs) are called via a fast path — direct Rust function pointer invocation, no bytecode dispatch:OpCode::Call => { match &self.registers[func_reg] { Value::FunctionVal(func_ptr, _) => { // Direct call to Rust function — no frame push let args = &self.registers[arg_start..arg_start + arg_count]; let result = func_ptr(args)?; self.registers[dest] = result; } Value::Function(_) | Value::MemoizedFunction(_) => { // Bytecode call — push frame (as in 05.2) } } } -
Method dispatch integration:
MethodCallopcode must preserve the full current dispatch chain, which is more complex than just the three-resolver priority chain. The actual dispatch order ineval_method_call()is:- Print methods (pre-interned Name comparison for
print/println) - Associated functions on
TypeRefreceivers (user registry + derived + builtin) - Callable struct field access (struct field that is a function value)
- MethodDispatcher chain: UserRegistry (priority 0) -> Collection (priority 1) -> Builtin (priority 2)
Additionally,
dispatch_method_call()(called fromcan_eval.rs) handlesModuleNamespacereceivers before even callingeval_method_call().OpCode::MethodCall => { let receiver = &self.registers[b]; let method_name = constants[c].as_name(); // Must route through the full dispatch chain, not a simplified version let result = self.eval_method_call(receiver.clone(), method_name, args)?; self.registers[a] = result; } - Print methods (pre-interned Name comparison for
-
Keep the tree-walker path as the semantic oracle during transition: first route complex method/iterator/protocol behavior through existing helpers, then inline or specialize only after parity is proven.
-
TPR checkpoint after 05.2: Run
/tpr-reviewafter call frames and scoping are implemented. This validates the core VM execution model (dispatch + frames) before adding the complexity of method dispatch and built-in integration in 05.3. All findings must be resolved before proceeding. -
Iterator protocol:
ForPrepare/ForLoopopcodes must work with Ori’sIteratorValuesystem. The iterator state lives in a register;ForLoopcallsnext()and branches on exhaustion. -
Error handling:
Propagateopcode implements?— check if value is Err/None, return early if so. This must interact correctly with the frame stack (pop frames on propagation). -
Capability propagation: Capabilities (
uses Random, etc.) are looked up by name. The VM needs access to the current capability bindings. Store in a side table indexed by frame, or pass through the global scope. -
Pattern matching in
matchexpressions: if Section 04 compilesDecisionTreedirectly, the VM must support the emitted branch/binding structure rather than assuming a simpleTestTagchain.DecisionTreehas 4 variants:Switch(branch on test kind — variant tag, literal equality, type test),Leaf(bind variables viaScrutineePathand execute arm body),Guard(evaluate guard expression, fall through toon_failif false), andFail(unreachable). -
FunctionExpdispatch:CanExpr::FunctionExphandles 15 special forms (Recurse,Parallel,Spawn,Timeout,Cache,With,Print,Panic,Catch,Todo,Unreachable,Channel,ChannelIn,ChannelOut,ChannelAll). These are currently dispatched inline incan_eval.rs. The VM should route these to helper functions rather than implementing each as dedicated opcodes. TheRecursepattern in particular has complex semantics (tail call optimization, memoization, base/step functions, parallel mode) that should remain in Rust helper code. -
Matrix tests (in
compiler/ori_eval/src/bytecode/vm/tests.rsand spec tests):- Dimensions: dispatch target (FunctionVal builtin, user method, derived method, collection method, builtin method) x receiver type (int, str, [int], {str: int}, struct, Option, Result) x FunctionExp kind (Print, Panic, Catch, Recurse — the 4 most common of the 15 variants)
- Semantic pin:
[1, 2, 3].map(x -> x * 2)via VM produces[2, 4, 6](identical to tree-walker) - Semantic pin:
catch(expr: { panic(msg: "boom") })via VM producesErr("boom")(identical to tree-walker) - Negative pin: calling a non-function value via
Callopcode producesnot_callableerror (same error as tree-walker) - TDD ordering: write builtin call tests FIRST (print, assert_eq), then method dispatch tests, then iterator tests, then FunctionExp tests
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (05.3) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-05.3 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 05.3: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
05.R Third Party Review Findings
- None.
05.N Completion Checklist
- Ackermann A(3,8) = 2045 completes in <1 second (Python: 0.386s, Ori current: timeout)
-
cargo bench -p oric --bench interpreter -- ackermann_3_5shows per-call cost ≤0.5µs (down from 63µs) - TailCall optimization verified: tail-recursive function uses constant stack depth (measured via frame_count)
- All built-in functions accessible from bytecode (print, assert_eq, panic, etc.)
- Method dispatch works for all collection types
- Iterator protocol works (for…yield produces correct results)
- Error propagation (?) works across bytecode frames
- VM standalone tests pass: comprehensive Rust unit/integration tests covering all opcodes, all CanExpr variants, control flow, closures, iterators, method dispatch, and error handling
- All existing Rust tests pass:
cargo tgreen (tree-walker unaffected) -
./test-all.shgreen (tree-walker path unchanged; VM tested via its own test suite) - Note: Full spec test parity via
cargo strequires Section 07 runtime integration across both entry points:ori run/ direct evaluation (evaluated()/run_evaluation())ori testinterpreter backend (TestRunner+Evaluator::load_module(...))
- Stack overflow produces clean error, not Rust panic
- Plan annotation cleanup:
bash .claude/skills/impl-hygiene-review/plan-annotations.sh --plan 05returns 0 annotations - All intermediate TPR checkpoint findings resolved
-
/tpr-reviewpassed -
/impl-hygiene-reviewpassed -
/improve-toolingretrospective completed — MANDATORY at section close, after both reviews are clean. Reflect on the section’s debugging journey (whichdiagnostics/scripts you ran, which command sequences you repeated, where you added ad-hocdbg!/tracingcalls, where output was hard to interpret) and identify any tool/log/diagnostic improvement that would have made this section materially easier OR that would help the next section touching this area. Implement every accepted improvement NOW (zero deferral) and commit each via SEPARATE/commit-push. The retrospective is mandatory even when nothing felt painful — that is exactly when blind spots accumulate. See.claude/skills/improve-tooling/SKILL.md“Retrospective Mode” for the full protocol.
Exit Criteria: Ackermann A(3,8) completes in under 1 second via the bytecode VM in isolation (standalone VM test, not yet through Salsa pipeline). Per-call cost measured at ≤0.5µs by Criterion benchmarks. All built-in functions and method dispatch work. The VM is a complete, tested execution engine ready for Salsa integration in Section 07. Full Salsa pipeline integration and dual-execution parity testing happen in Sections 07 and 06 respectively.