Section 12: Verification & Benchmarks

Context: Representation optimization is uniquely dangerous because bugs manifest as silent data corruption rather than crashes. If a value is narrowed incorrectly, the program produces wrong results without any error. The only way to catch this is exhaustive comparison between optimized and unoptimized paths.

Reference implementations:

Rust crater: Tests compiler changes against the entire crates.io ecosystem
LLVM test-suite: Standardized benchmarks for measuring optimization impact
Go test/bench/: Microbenchmarks for individual operations + macrobenchmarks for real programs

Depends on: ALL sections (this is the final verification).

12.1 Test Matrix

Build a comprehensive test matrix covering every optimization through the full pipeline.

12.2 Dual-Execution Equivalence

Verify that optimized code produces identical results to unoptimized code.

Extend ./diagnostics/dual-exec-verify.sh to compare:
- (a) Interpreter (eval) vs AOT with all optimizations
- (b) AOT without optimizations vs AOT with all optimizations
- (c) Interpreter (eval) vs AOT without optimizations (sanity check for the bypass path itself)
- Mode (b)/(c) requires a flag to disable representation optimization
- Keep the script’s existing interfaces ([test-path], --main-only, --test-only, --json) working; --compare-repr-opt is an extension, not a rewrite
Add --no-repr-opt flag to ori build:
- Skips all §02-§11 optimizations
- Uses canonical representations (current behavior)
- Produces reference output for comparison
Run comparison on ALL spec tests:
```
./diagnostics/dual-exec-verify.sh --compare-repr-opt tests/
```
- Every test must produce bit-identical output (same values, same ordering)
- Float comparisons must also be bit-identical — no ULP tolerance. The §05 narrowing guarantee is “zero precision loss”, so any output difference indicates a narrowing bug or a printing bug, both of which must be caught. (If a future optimization allows lossy narrowing via opt-in #repr("f32"), those specific tests can use ULP tolerance, but the default must be exact.)
Run comparison on benchmark programs:
- tests/benchmarks/bench_small.ori
- tests/benchmarks/bench_medium.ori
- tests/benchmarks/ (all)
- Results must match exactly (bit-identical for both integer and float benchmarks, consistent with the zero-precision-loss guarantee in §05)
/tpr-review passed — independent review found no critical or major issues (or all findings triaged)
/impl-hygiene-review passed — hygiene review clean. MUST run AFTER /tpr-review is clean.
Subsection close-out (12.2) — MANDATORY before starting the next subsection. Run /improve-tooling retrospectively on THIS subsection’s debugging journey (per .claude/skills/improve-tooling/SKILL.md “Per-Subsection Workflow”): which diagnostics/ scripts you ran, where you added dbg!/tracing calls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE /commit-push using a valid conventional-commit type (build(diagnostics): ... — surfaced by section-12.2 retrospective — build/test/chore/ci/docs are valid; tools(...) is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 12.2: no tooling gaps”. Update this subsection’s status in section frontmatter to complete.
/sync-claude section-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW.
Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

12.3 Memory Safety Verification

Valgrind (heap memory):
```
./diagnostics/valgrind-aot.sh tests/valgrind/
```
- All existing Valgrind tests must pass
- Add new Valgrind tests for:
  - SSO string operations (inline → heap transitions)
  - SVO list operations (inline → heap transitions)
  - Stack-promoted values with RC’d fields
  - Narrowed struct fields with padding
  - Niche-filled enum pattern matching
Create diagnostics/asan-test.sh (new script — required for AddressSanitizer testing):
- Build ori with AddressSanitizer: RUSTFLAGS="-Zsanitizer=address" cargo +nightly b --target x86_64-unknown-linux-gnu
- Run the spec test suite with the ASan-enabled binary
- Run the Valgrind test suite programs through the ASan binary
- Exit non-zero if any ASan report fires (exit code check)
- Follow the pattern of valgrind-aot.sh: support --no-color flag, print summary at end
- Document the environment restriction: this is a nightly + Linux/x86_64 workflow; if unavailable, the script must fail clearly rather than silently skipping
AddressSanitizer (stack memory):
```
./diagnostics/asan-test.sh
```
- Stack-promoted values must not be accessed after function return
- No buffer overflows in packed bool arrays
- No out-of-bounds in narrow-element collections
Stress tests:
- Create 10M small allocations → stress RC header compression
- Create 10K threads sharing values → stress atomic/non-atomic boundary
- Deeply nested Option<Option<...<int>>> → stress niche filling
- 100MB packed bool array → stress packed operations
/tpr-review passed — independent review found no critical or major issues (or all findings triaged)
/impl-hygiene-review passed — hygiene review clean. MUST run AFTER /tpr-review is clean.
Subsection close-out (12.3) — MANDATORY before starting the next subsection. Run /improve-tooling retrospectively on THIS subsection’s debugging journey (per .claude/skills/improve-tooling/SKILL.md “Per-Subsection Workflow”): which diagnostics/ scripts you ran, where you added dbg!/tracing calls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE /commit-push using a valid conventional-commit type (build(diagnostics): ... — surfaced by section-12.3 retrospective — build/test/chore/ci/docs are valid; tools(...) is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 12.3: no tooling gaps”. Update this subsection’s status in section frontmatter to complete.
/sync-claude section-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW.
Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

12.4 Performance Benchmarks

Baseline (before optimizations):

# NOTE: perf-baseline.sh currently emits a human-readable table, not JSON.
# Either add a --json flag to perf-baseline.sh, or create perf-compare.sh
# to parse the existing table format. Using .txt extension to match current output.
./scripts/perf-baseline.sh --release > baseline.txt

Current baseline coverage is only bench_hello, bench_small, and bench_medium; add the string/struct/ARC-heavy programs before using the broader target table below as a release gate

Create scripts/perf-compare.sh (new script — required for baseline comparison):
- Takes two perf-baseline.sh output files as arguments
- Parses the table format from perf-baseline.sh (human-readable, not JSON)
- Reports per-benchmark delta, geometric mean improvement/regression, and highlights any metric exceeding its threshold
- Exits non-zero if any threshold is violated (per the targets table in §12.4)
- Follow the pattern of cow-benchmark.sh --compare for argument parsing and output formatting

Post-optimization measurement:

./scripts/perf-baseline.sh --release > optimized.txt
./scripts/perf-compare.sh baseline.txt optimized.txt

Metrics to track:

Metric	Measurement	Target
Compile time	`time ori build`	< 10% regression
AOT binary size	`ls -la output`	≤ 5% increase from extra codepaths
Runtime (bench_medium)	`time ./bench_medium`	≥ 10% improvement
Runtime (string-heavy)	`time ./string_bench`	≥ 30% improvement (SSO)
Memory (struct-heavy)	peak RSS	≥ 20% reduction (narrowing)
Memory (collection-heavy)	peak RSS	≥ 30% reduction (SVO + SSO)
RC operations	`grep ori_rc output.ll \| wc -l`	≥ 40% fewer (triviality + escape)

Microbenchmarks (add to compiler/oric/benches/):
- repr_narrowing: Measure ReprPlan computation time
- range_analysis: Measure range analysis time per function
- escape_analysis: Measure escape analysis time per function
- rc_atomic_vs_nonatomic: Measure RC operation throughput
Macrobenchmarks (add to tests/benchmarks/):
- string_processing.ori: Short string manipulation (SSO benefit)
- data_structures.ori: Small struct creation/destruction (narrowing benefit)
- option_heavy.ori: Option manipulation (niche benefit)
- arc_heavy.ori: Many small heap allocations (header compression benefit)
/tpr-review passed — independent review found no critical or major issues (or all findings triaged)
/impl-hygiene-review passed — hygiene review clean. MUST run AFTER /tpr-review is clean.
Subsection close-out (12.4) — MANDATORY before starting the next subsection. Run /improve-tooling retrospectively on THIS subsection’s debugging journey (per .claude/skills/improve-tooling/SKILL.md “Per-Subsection Workflow”): which diagnostics/ scripts you ran, where you added dbg!/tracing calls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE /commit-push using a valid conventional-commit type (build(diagnostics): ... — surfaced by section-12.4 retrospective — build/test/chore/ci/docs are valid; tools(...) is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 12.4: no tooling gaps”. Update this subsection’s status in section frontmatter to complete.
/sync-claude section-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW.
Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

12.5 Code Journeys

Run /code-journey to test the full pipeline end-to-end with progressively complex programs.

Run /code-journey with programs exercising each optimization:
- Journey 1: Simple narrowing (loop counter, struct field)
- Journey 2: SSO strings (creation, concat, slice)
- Journey 3: Escape analysis (temporary values, closures)
- Journey 4: Thread-local ARC (single-threaded vs multi-threaded)
- Journey 5: Combined (program using all optimizations together)
All CRITICAL findings triaged (fixed or tracked)
Eval and AOT paths produce identical results for all journeys
Journey results archived in plans/code-journeys/
/tpr-review passed — independent review found no critical or major issues (or all findings triaged)
/impl-hygiene-review passed — hygiene review clean. MUST run AFTER /tpr-review is clean.
Subsection close-out (12.5) — MANDATORY before starting the next subsection. Run /improve-tooling retrospectively on THIS subsection’s debugging journey (per .claude/skills/improve-tooling/SKILL.md “Per-Subsection Workflow”): which diagnostics/ scripts you ran, where you added dbg!/tracing calls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE /commit-push using a valid conventional-commit type (build(diagnostics): ... — surfaced by section-12.5 retrospective — build/test/chore/ci/docs are valid; tools(...) is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 12.5: no tooling gaps”. Update this subsection’s status in section frontmatter to complete.
/sync-claude section-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW.
Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

12.6 Documentation

Update CLAUDE.md with:
- ori_repr crate description and key paths
- MachineRepr enum variant summary
- ReprPlan query interface and tracing
- ReprAttribute enum and #repr attribute interaction
- New runtime functions (ori_rc_*_nonatomic, ori_rc_*_i8/i16/i32, SSO, SVO)
- --no-repr-opt flag documentation
- ori_repr dependency chain: ori_types → ori_arc → ori_repr → ori_llvm
Update spec (docs/ori_lang/v2026/spec/annex-e-system-considerations.md):
- Mark implemented optimizations as “implemented” vs “future”
- Update the built-in type representations table to match the current runtime baseline before adding new optimizations:
  - str is the 24-byte SSO-capable form, not { len, data }
  - Option<T> / Result<T, E> currently lower with i64 tags in LLVM, not i8
  - RC-managed heap values currently use the V5 32-byte header (data_size, elem_dec_fn, elem_count, strong_count)
- Add SSO/SVO to the built-in type representations table
Update .claude/rules/ with:
- ReprPlan query patterns
- How to add new representation optimizations
Update plans/repr-opt/00-overview.md with final metrics
/tpr-review passed — independent review found no critical or major issues (or all findings triaged)
/impl-hygiene-review passed — hygiene review clean. MUST run AFTER /tpr-review is clean.
Subsection close-out (12.6) — MANDATORY before starting the next subsection. Run /improve-tooling retrospectively on THIS subsection’s debugging journey (per .claude/skills/improve-tooling/SKILL.md “Per-Subsection Workflow”): which diagnostics/ scripts you ran, where you added dbg!/tracing calls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE /commit-push using a valid conventional-commit type (build(diagnostics): ... — surfaced by section-12.6 retrospective — build/test/chore/ci/docs are valid; tools(...) is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 12.6: no tooling gaps”. Update this subsection’s status in section frontmatter to complete.
/sync-claude section-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW.
Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

12.7 Completion Checklist

Scripts created in this section:

diagnostics/asan-test.sh created and functional (builds ASan binary, runs spec+valgrind tests, exits non-zero on ASan report)
scripts/perf-compare.sh created and functional (parses perf-baseline.sh output, reports deltas, exits non-zero on threshold violations)
tests/valgrind/threads/ directory created with thread_local_only.ori and channel_send.ori

Verification:

Exit Criteria: Running ./scripts/perf-compare.sh baseline.txt optimized.txt shows:

Runtime: geometric mean ≥ 10% improvement across benchmark suite
Memory: geometric mean ≥ 20% reduction across benchmark suite
RC operations: ≥ 40% fewer in generated LLVM IR
Correctness: 0 mismatches in dual-execution, 0 Valgrind errors
All commands: ./test-all.sh, ./clippy-all.sh, ./llvm-test.sh green

12.R Third Party Review Findings

[TPR-12-001][minor] section-12-verification.md:127-131 — No interpreter vs AOT-unoptimized sanity test. §12.2 specifies (a) interpreter vs AOT-optimized and (b) AOT-unoptimized vs AOT-optimized. Missing: (c) interpreter vs AOT with --no-repr-opt as a sanity check that the bypass flag itself works correctly. If codegen changes in §04/§06/§07 accidentally break the unoptimized path, comparison (b) would produce false positives. Action: Add comparison (c): interpreter vs AOT-unoptimized to the dual-execution verification matrix.