The Problem

Compilers are pipelines. Source code flows through a lexer, parser, type checker, intermediate representations, optimizations, and finally code generation. Each phase transforms the program, and each transformation can introduce subtle issues.

Unit tests and end-to-end tests both check for specific scenarios — known inputs with expected outputs. They answer "does this case work?" But a compiler can pass every test and still produce suboptimal code. A type checker might produce correct types that lead to redundant ARC operations. A canonicalization pass might restructure code in ways that confuse the LLVM backend. No one wrote a test for these cases because no one knew to look.

Manual review can find these issues — but it takes hours per program and doesn't scale as the compiler evolves.

The Solution

A Code Journey is exploratory, not prescriptive. It takes a piece of Ori code and walks it through every compiler phase — from source text to final binary — with AI examining the output at each stage. There's no predefined checklist. The AI looks at what the compiler actually produced and reports what it finds.

Source → Lexer → Parser → Type Check → Canon → ARC → Eval / LLVM

Think of it like walking through a forest — not searching for a specific tree, but noticing what's there. A redundant instruction here. A missing attribute there. An ARC increment that could have been elided. These aren't things you'd write a test for, because you didn't know they existed until you looked.

What Makes It Unique

vs Unit / E2E Tests

Tests verify specific scenarios — known inputs with expected outputs. They confirm what you already thought about. Journeys discover things you didn't know to look for: cross-phase inefficiencies, missed optimizations, subtle codegen issues that no one wrote a test case for.

vs Fuzzing

Fuzzers search for crashes — inputs that break things. Journeys search for quality. The compiler works, but is the output good? Are there unnecessary reference counting operations? Missing function attributes? A fuzzer would never flag these because nothing crashed.

vs Manual Review

An expert reviewing LLVM IR by hand does the same kind of open-ended exploration — but it takes hours per function and doesn't repeat itself. A journey covers the same ground in minutes with structured scoring, and runs again after every compiler change.

The Scoring System

Every journey is scored across 7 dimensions, each rated on a 10-point scale. Scores are computed from measurable metrics — instruction ratios, violation counts, compliance percentages — fed through a deterministic scoring script. The AI counts things from the generated IR; the script maps those counts to scores via strict threshold tables. Same inputs always produce the same scores.

How It Works

The scoring system separates three concerns: exploration (open-ended AI analysis of the compiler output), measurement (extracting countable metrics from the analysis), and scoring (deterministic mapping from metrics to numbers). The exploration stays genuinely open-ended — the rubric scores what was found, it doesn't constrain what to look for.

The 7 Dimensions

15% Instruction Efficiency Ratio of actual to ideal instructions. A function with 7 instructions where 3 would suffice scores lower than one matching the ideal exactly.

20% ARC Correctness Reference counting violations: unbalanced pairs (leaks), wasted pairs (overhead), missing borrow elision, scalar RC ops (bugs).

10% Attributes & Safety LLVM function attributes like nounwind, fastcc, noreturn. Each applicable attribute checked for presence.

10% Control Flow Empty basic blocks, redundant branches, trivial phi nodes, unreachable code — structural defects in the generated IR.

20% IR Quality Each function compared against hand-written ideal IR. Extra instructions must be justified (overflow checking, ABI) or they count against the score.

10% Binary Quality Does the binary produce correct output? Does it crash? Does the interpreter agree with the compiled binary?

15% Other Findings Discoveries that don't fit the 6 defined categories — the escape valve that keeps analysis genuinely open-ended.

Gate Conditions

Critical issues cap the maximum score regardless of other metrics. No amount of clean IR can compensate for fundamentally wrong output.

Binary crash or wrong output — Binary Quality = 0, overall capped at 3.0
RC on scalar types — ARC Correctness = 0 (fundamentally wrong)
Unbalanced RC pairs — ARC Correctness capped at 3 (leak or double-free)
Wrong attributes applied — Attributes capped at 2 (worse than missing)
Incorrect control flow — Control Flow capped at 1

9-10 Excellent

7-8 Good

5-6 Fair

1-4 Needs work

What It Catches

Journeys have found real issues that conventional tests missed:

Redundant ARC operations — an rc_inc immediately followed by rc_dec on the same value, invisible to phase-local tests but obvious when tracing the full pipeline
Missing function attributes — a function that never throws marked without nounwind, preventing LLVM from optimizing call sites
Suboptimal block layout — an if/else compiled into three basic blocks where two would suffice
Unnecessary instructions — a return value computed and stored to a stack slot, then immediately loaded back — where a direct register return would work
Checked negation overflow — integer negation missing overflow guards for INT_MIN, caught by cross-referencing the type checker's knowledge with the generated IR

See It in Action

Each journey is a self-contained deep dive into one piece of Ori code. Pick a journey that interests you and follow the code from source to binary.

Explore the Journeys

What is a Code Journey?