s17 — Compile-Cost Baselines + Fast-Tier Throughput Gates

Goal

Compile cost is measured before it can be traded away: one reproducible harness records wall time, CPU time, peak RSS, phase invocations, artifact size, and finalization/link time at backend scope and end-to-end scope. It closes the fast-tier beat-Cranelift gate and freezes the paired LLVM -O3 baselines that s19/s20 must beat rather than merely match for output quality.

Implementation Sketch

Tooling: extend compiler_repo/diagnostics/ with a reusable backend-compile-bench driver plus Criterion benches in oric/benches/. The driver owns process isolation, corpus generation, cache-state preparation, peak-RSS collection, same-machine paired runs, schema validation, and checked-in ledger regeneration; no chat-sequenced benchmark procedure.
Phase telemetry: translate/optimize/isel/regalloc/encode/fragment-write/finalize/link invocation counts and durations, total CPU, wall time, peak RSS, cache hits/misses, and output size. Backend-scope (realized IR to finalized artifact) and end-to-end (source to linked artifact) remain separate columns; neither may substitute for the other.
Corpus: hello/small/medium, generated large modules with many independent functions, generated large individual functions, and the rosetta + benchmark programs. The schema reserves cold, warm-no-change, one-function-body-edit, ABI/layout-change, typed-fact-change, jobs=1, and jobs=N cases; s17 records baseline rows and s17B fills the cache/parallel cases.
Paired baselines: (a) Cranelift at its corresponding no-opt/fast setting over equivalent workload shapes for the fast-tier backend comparison; (b) LLVM -O3 compiling the exact same Ori source, target, runtime, linker, and output checksum for the optimizing-tier baseline. Pin compiler SHAs, target/features, host/runner class, corpus hash, and methodology version. Unsupported or drifted environments report unavailable, never pass.
Statistics and fairness: warm up explicitly, run samples sequentially, randomize pair order, retain raw readings, and compute medians plus a one-sided confidence bound through /calc. “Faster” means the confidence bound for native/reference wall-time ratio is strictly below 1.0; equality and noise-overlap are blocker rows. RSS uses isolated-process peak readings and a native/reference ratio no greater than 1.0.
Ratchets: per-phase budgets derive from measured gaps; the CI regression gate follows the cow-benchmark precedent and carries injected-slowdown and stale-baseline sensitivity tests. The fast-tier standings contain zero blocker rows only when every size class beats Cranelift; LLVM -O3 rows remain pending inputs until s20.
Optimization work this section: only structural compile-speed fixes surfaced by telemetry (allocation churn, re-walks) — the architecture (SoA, recycling, single-pass) already encodes the speed design from s01/s03.

Test Strategy

Bench determinism: measurement samples run sequentially to avoid cross-sample CPU contention; jobs=N is concurrency inside the measured compiler, not concurrent benchmark processes. Repeated regeneration on the fixed runner preserves row classification.
Sensitivity pins: injected phase slowdown, injected RSS inflation, checksum mismatch, missing comparison row, and stale corpus/toolchain/methodology fingerprints each force a blocker or unavailable result.

Work Items

Reusable backend-compile-bench driver + per-phase telemetry/Criterion family over the throughput corpus, reporting backend/end-to-end time, CPU, peak RSS, invocation counts, artifact size, finalization, and link separately.
Paired Cranelift-fast and LLVM -O3 baseline harnesses + fairness/statistics methodology + checked-in raw/derived artifacts with regeneration command and /calc-verified classifications.
CI compile-cost regression gate (cow-benchmark precedent) + injected time/RSS/checksum failures + stale corpus/toolchain/methodology pins.
Structural speed fixes surfaced by telemetry landed; every size-class fast-vs-Cranelift blocker closed and standings recorded, while paired LLVM -O3 compile-time/RSS rows are frozen as s20 inputs.