0%

s20 — Output-Quality Gates: The Beat-O3 Ledger

Goal

The mission’s core empirical claim becomes a per-program ledger: native optimizing-tier output >= LLVM -O3 on the rosetta + benchmark corpus, program by program, with iterative gap-closing rounds feeding new mid-end work — plus GCC -O3 stretch comparisons.

Implementation Sketch

  • Promotion-ledger infrastructure: per-program rows {program, native runtime, LLVM -O3 runtime, ratio, checksum-match, status} (rosetta benchmark_results.json + cow baseline.json precedents); checksum fairness per the rosetta discipline (identical observable output before timing counts); medians-of-N, sequential runs, /calc for every ratio; regeneration scripts (no hand-rolled numbers).
  • Corpus: tests/benchmarks (incl. cow/, aims_burden/) + the rosetta implemented set (grows as rosetta-stress-test progresses; the ledger consumes whatever exists per round) + the Rust reference twins where present.
  • Gap-closing ratchet rounds: rank programs by ratio; profile the worst (perf/disasm tooling — extend diagnostics/ scripts to cover native binaries per tooling-first); classify the gap (missing pass / isel quality / scheduling / regalloc / runtime-call overhead); route fixes — pass work amends s19’s roster, isel/regalloc fixes go to their subsystems via /fix-bug or plan amendment; re-measure; ledger ratchets monotonically.
  • GCC stretch: where a gcc toolchain target overlaps (linux x86-64/aarch64/s390x/riscv64), record GCC -O3 columns for the C-twin programs (rosetta C++ baselines) as stretch evidence — informational columns, not promotion-blocking.
  • CI quality-regression gate: once a program reaches >=1.0, regression below it fails CI (ratchet lock-in).

Test Strategy

  • The ledger IS the test artifact; gates: checksum-mismatch invalidates a row (never timed), stale-baseline pin (corpus SHA binding), regression gate self-test with an injected slowdown.

Work Items

  • Promotion-ledger infra (schema, regeneration scripts, checksum fairness, /calc-verified ratios) over the benchmark + rosetta corpus.
  • Native-binary profiling/disassembly diagnostic tooling (extend diagnostics/ per tooling-first; no one-off scripts).
  • Gap-closing ratchet rounds: profile-classify-fix-remeasure until the corpus ledger holds >=1.0 vs LLVM -O3 per-program (amendments to s19 roster recorded per round).
  • GCC -O3 stretch columns where toolchains overlap (informational).
  • CI quality-regression gate locking achieved ratios; ledger feeds s23 promotion.