The Problem
Compilers are pipelines. Source code flows through a lexer, parser, type checker, intermediate representations, optimizations, and finally code generation. Each phase transforms the program, and each transformation can introduce subtle issues.
Unit tests and end-to-end tests both check for specific scenarios — known inputs with expected outputs. They answer "does this case work?" But a compiler can pass every test and still produce suboptimal code. A type checker might produce correct types that lead to redundant ARC operations. A canonicalization pass might restructure code in ways that confuse the LLVM backend. No one wrote a test for these cases because no one knew to look.
Manual review can find these issues — but it takes hours per program and doesn't scale as the compiler evolves.
The Solution
A Code Journey is exploratory, not prescriptive. It takes a piece of Ori code and walks it through every compiler phase — from source text to final binary — with AI examining the output at each stage. There's no predefined checklist. The AI looks at what the compiler actually produced and reports what it finds.
Source → Lexer → Parser → Type Check → Canon → ARC → Eval / LLVM Think of it like walking through a forest — not searching for a specific tree, but noticing what's there. A redundant instruction here. A missing attribute there. An ARC increment that could have been elided. These aren't things you'd write a test for, because you didn't know they existed until you looked.
What Makes It Unique
vs Unit / E2E Tests
Tests verify specific scenarios — known inputs with expected outputs. They confirm what you already thought about. Journeys discover things you didn't know to look for: cross-phase inefficiencies, missed optimizations, subtle codegen issues that no one wrote a test case for.
vs Fuzzing
Fuzzers search for crashes — inputs that break things. Journeys search for quality. The compiler works, but is the output good? Are there unnecessary reference counting operations? Missing function attributes? A fuzzer would never flag these because nothing crashed.
vs Manual Review
An expert reviewing LLVM IR by hand does the same kind of open-ended exploration — but it takes hours per function and doesn't repeat itself. A journey covers the same ground in minutes with structured scoring, and runs again after every compiler change.
The Scoring System
Every journey is scored across 7 dimensions, each rated on a 10-point scale. Scores are computed from measurable metrics — instruction ratios, violation counts, compliance percentages — fed through a deterministic scoring script. The AI counts things from the generated IR; the script maps those counts to scores via strict threshold tables. Same inputs always produce the same scores.
How It Works
The scoring system separates three concerns: exploration (open-ended AI analysis of the compiler output), measurement (extracting countable metrics from the analysis), and scoring (deterministic mapping from metrics to numbers). The exploration stays genuinely open-ended — the rubric scores what was found, it doesn't constrain what to look for.
The 7 Dimensions
nounwind, fastcc, noreturn. Each applicable attribute checked for presence. Gate Conditions
Critical issues cap the maximum score regardless of other metrics. No amount of clean IR can compensate for fundamentally wrong output.
- Binary crash or wrong output — Binary Quality = 0, overall capped at 3.0
- RC on scalar types — ARC Correctness = 0 (fundamentally wrong)
- Unbalanced RC pairs — ARC Correctness capped at 3 (leak or double-free)
- Wrong attributes applied — Attributes capped at 2 (worse than missing)
- Incorrect control flow — Control Flow capped at 1
What It Catches
Journeys have found real issues that conventional tests missed:
- Redundant ARC operations — an
rc_incimmediately followed byrc_decon the same value, invisible to phase-local tests but obvious when tracing the full pipeline - Missing function attributes — a function that never throws marked without
nounwind, preventing LLVM from optimizing call sites - Suboptimal block layout — an
if/elsecompiled into three basic blocks where two would suffice - Unnecessary instructions — a return value computed and stored to a stack slot, then immediately loaded back — where a direct register return would work
- Checked negation overflow — integer negation missing overflow guards
for
INT_MIN, caught by cross-referencing the type checker's knowledge with the generated IR
See It in Action
Each journey is a self-contained deep dive into one piece of Ori code. Pick a journey that interests you and follow the code from source to binary.
Explore the Journeys