Interpreter Performance Engineering
7 sections
Overview
Make Ori's interpreter as close to native execution speed as possible. Current function call overhead is ~63µs/call (measured via Ackermann benchmark), approximately 100-600x slower than a register-based bytecode VM like Lua. This plan transforms the tree-walking interpreter into a high-performance bytecode VM through incremental, independently testable phases. Tree-walker hot-path work (Sections 02-03) is useful only while benchmarks show allocation and clone churn still dominate; the critical path is preserving evaluator semantics while moving execution to bytecode.
Planned
7 sections
Benchmark Infrastructure
Establish reproducible interpreter performance measurement with Criterion benchmarks and gate tests
Zero-Allocation Call Path
Eliminate all heap allocations from the interpreter function call hot path — target 0 mallocs per call
Value Passing Optimization
Eliminate unnecessary Value clones in parameter binding, self-binding, and capture binding
Bytecode Compilation
Compile CanExpr IR to a register-based bytecode instruction set — eliminate recursive tree-walking dispatch
Register-Based VM
Execute bytecode in a tight dispatch loop with a contiguous register file — target ≤0.5µs/call on the measured dev machine, with Python 3.12 as the stretch comparison
Verification
Prove the bytecode VM is correct (identical to tree-walker) and fast (matches Python 3.12) with permanent regression guards
Salsa Integration & Transition
Integrate the bytecode VM into the runtime execution pipeline, preserve the tree-walker for const-eval/pattern execution, and provide a feature flag for gradual rollout