s20 — Beat O3 Without Paying O3: Quality + Compile-Cost Gates

Goal

The mission’s core empirical claim becomes one paired per-program ledger: native optimizing-tier output is at least as fast as LLVM -O3 while the native compiler is strictly faster at backend and end-to-end scope and uses no more peak memory. Gap-closing rounds must improve this Pareto surface; generated-code wins cannot buy permission to recreate LLVM’s optimization-time pathology.

Implementation Sketch

Promotion-ledger infrastructure extends the s17 schema. Per-program rows contain checksums; native and LLVM -O3 generated-program runtime; runtime score LLVM runtime / native runtime (higher is better); backend-scope and end-to-end compile wall/CPU time with native/LLVM ratios (lower is better); peak RSS ratio; artifact size; optimizer budget use/exhaustion; finalization/link share; raw-sample references; confidence bounds; and blocker reasons. All ratios/classifications come from the regeneration tooling and /calc, never hand-entered.
Corpus: tests/benchmarks (incl. cow/, aims_burden/) + the rosetta implemented set (grows as rosetta-stress-test progresses; the ledger consumes whatever exists per round) + the Rust reference twins where present.
Per-row promotion conditions: checksum match; runtime score >= 1.0; the one-sided confidence bound for both native/LLVM compile-time ratios is strictly below 1.0; peak RSS ratio <= 1.0; no missing measurement/fingerprint; and no unclassified optimizer-budget event. Equality or confidence overlap on compile latency remains a blocker because the mission is to remove, not equal, LLVM’s compile-time tax.
Gap-closing ratchet rounds rank both runtime and compile-cost blockers. Profile the worst (perf/disasm plus compiler phase/RSS tooling in diagnostics); classify missing pass/isel/schedule/regalloc/runtime overhead separately from repeated scan/IR growth/cache miss/serialization/link overhead. Route fixes to s19, s17B, or the relevant subsystem and remeasure. A runtime fix that creates a compile-time/RSS blocker or widens invalidation is not a monotonic improvement and cannot land in the promoted roster.
GCC stretch: where a gcc toolchain target overlaps (linux x86-64/aarch64/s390x/riscv64), record GCC -O3 columns for the C-twin programs (rosetta C++ baselines) as stretch evidence — informational columns, not promotion-blocking.
CI Pareto-regression gate: once a row passes, generated-program runtime, backend/end-to-end compile latency, and RSS classifications are locked. Changing optimizer fuel/growth limits, pass roster/order, cache-key schema, default jobs, or corpus/toolchain fingerprints forces full paired regeneration; a stale ledger cannot bless the change.

Test Strategy

The ledger IS the test artifact. Sensitivity pins inject generated-program slowdown, compiler-phase slowdown, IR explosion/RSS inflation, missing cache dependency, checksum mismatch, confidence-overlap, and stale fingerprints; each must create the correct blocker rather than silently disappear.

Work Items

Paired Pareto-ledger infra (schema, raw samples, regeneration scripts, checksum fairness, confidence bounds, /calc-verified runtime/compile-time/RSS ratios) over the benchmark + rosetta corpus.
Native-binary profiling/disassembly diagnostic tooling (extend diagnostics/ per tooling-first; no one-off scripts).
Gap-closing ratchet rounds: profile-classify-fix-remeasure until every row simultaneously holds runtime score >=1.0, strict backend/end-to-end compile-latency wins vs LLVM -O3, peak RSS <= LLVM -O3, and no unclassified budget event; s19/s17B amendments recorded per round.
GCC -O3 stretch columns where toolchains overlap (informational).
CI Pareto-regression gate locks achieved runtime/compile-time/RSS classifications and rejects stale roster/budget/key/corpus baselines; ledger feeds s23 promotion.