Section 12: Verification Dashboard & Regression Tracking
Status: Not Started
Goal: Build IR baseline infrastructure that captures golden LLVM IR for key programs, detects any IR shape change on subsequent builds, and provides --bless mode for intentional updates. Integrate llvm-reduce for automatic test case reduction when IR regressions are found. Track verification trends (pass counts, verification findings, snapshot diff rates) over time via CI artifacts. This is the lowest-priority section — the “dashboard” is deliberately minimal, focusing on actionable regression detection rather than a UI.
Success Criteria:
- IR baselines captured for ≥20 key programs — satisfies mission criterion: “IR regression tracking”
-
--comparemode detects IR shape changes — satisfies mission criterion: “IR shape regression detection” -
llvm-reducereduces failing IR — satisfies mission criterion: “automatic test case reduction” - Trend data captured in CI artifacts — satisfies mission criterion: “historical comparison”
Context: This section complements the FileCheck IR assertions from Section 07. FileCheck tests verify specific IR patterns (e.g., “RC inc/dec pair appears for this function”) — they catch pattern violations. IR baselines catch ANY IR change, not just pattern violations. A new optimization that changes the IR shape without violating any CHECK pattern is invisible to FileCheck but caught by baseline comparison. This dual approach — pattern assertions for known-important shapes, full baselines for unexpected changes — provides comprehensive IR regression coverage. The baseline infrastructure follows the pattern already established by scripts/perf-baseline.sh (performance baselines) but applies it to IR shape.
Reference implementations:
- Ori
scripts/perf-baseline.sh: Existing baseline capture/compare pattern with--releaseflag and structured output. The IR baseline script follows the same UX conventions. - Rust
perf.rust-lang.org: Historical tracking of metrics across commits with automatic regression detection and bisection. - LLVM
llvm/tools/llvm-reduce/: Automatic delta-debugging tool that reduces LLVM IR to the minimal reproducing case for a given “interestingness test” script.
Depends on: Section 11 (CI produces the artifacts that baselines track; nightly/weekly jobs provide the execution cadence for baseline comparison).
12.1 IR Baseline Capture and Comparison
File(s): scripts/ir-baseline.sh, tests/baselines/ (golden IR directory)
Create the baseline capture/compare infrastructure. This is the core of the section — everything else builds on it.
-
Create
scripts/ir-baseline.shfollowing theperf-baseline.shpattern:# Usage: scripts/ir-baseline.sh [MODE] [OPTIONS] # # Modes: # --capture Capture golden IR baselines for all key programs # --compare Compare current IR against captured baselines # --bless Update baselines to match current IR (intentional changes) # --list List all baselined programs and their status # # Options: # --release Use release build (default: debug) # --program FILE Capture/compare only the specified program # --json Machine-readable output # --verbose Show full diff on mismatch # --no-color Disable color output -
Define the key program list (
tests/baselines/programs.txt). Select ≥20 programs that collectively exercise the major codegen patterns:# Format: <ori_file_path> <description> tests/spec/types/int/basic.ori # Integer arithmetic tests/spec/types/str/concat.ori # String operations + RC tests/spec/types/list/map_filter.ori # Collection operations tests/spec/traits/iterator/basic.ori # Iterator protocol tests/spec/traits/iterator/chain.ori # Iterator chaining tests/spec/collections/cow/basic_cow.ori # COW patterns tests/spec/types/closures/capture.ori # Closure codegen tests/spec/types/enum/match_exhaustive.ori # Enum dispatch tests/spec/types/struct/nested.ori # Struct layout + fields tests/spec/types/option/map_unwrap.ori # Option handling tests/spec/types/result/error_propagation.ori # Result + ? operator tests/spec/traits/derive/eq_comparable.ori # Derived trait codegen tests/spec/functions/recursive.ori # Recursion + tail calls tests/spec/functions/higher_order.ori # Higher-order functions tests/spec/patterns/nested_match.ori # Complex pattern matching tests/spec/types/tuple/access.ori # Tuple codegen tests/spec/types/map/basic.ori # Map operations tests/spec/types/set/basic.ori # Set operations tests/spec/loops/for_yield.ori # For-yield comprehensions tests/spec/loops/while_break.ori # While + breakThe exact file paths will be determined at implementation time based on what exists. The criteria are: each program exercises a distinct codegen path, and together they cover RC emission, COW, closures, iterators, pattern matching, and control flow.
-
Implement baseline capture (
--capture):- For each program in the list, compile with
ORI_DUMP_AFTER_LLVM=1 cargo run -- build <file> - Capture the LLVM IR from stderr
- Normalize the IR (see normalization rules below)
- Write to
tests/baselines/<program_name>.baseline.ll
- For each program in the list, compile with
-
Implement IR normalization for stable comparison. Raw LLVM IR contains elements that change across builds without semantic significance:
- Metadata IDs:
!0,!1, … — strip or renumber sequentially - Unnamed temporaries:
%0,%1, … — already sequential in LLVM output, keep as-is - Module-level metadata:
source_filename,target datalayout,target triple— strip (machine-specific) - Debug info:
!dbg !Nreferences — strip entirely (debug info is orthogonal to codegen correctness) - Alignment attributes: Keep (alignment changes are semantically significant)
- Function attributes: Keep attribute groups but strip group numbers (
#0,#1) and renumber
- Metadata IDs:
-
Implement baseline comparison (
--compare):- For each program, capture current IR (same as
--capture) - Load the golden baseline
- Diff the normalized IR
- Report mismatches with unified diff output
- Exit code: 0 = all match, 1 = mismatches found, 2 = error
- For each program, capture current IR (same as
-
Implement bless mode (
--bless):- Capture current IR for all programs (or
--programtarget) - Overwrite the golden baselines
- Report which baselines were updated
- Capture current IR for all programs (or
-
Subsection close-out (12.1) — MANDATORY before starting 12.2:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection — reflect on the debugging journey for 12.1 specifically: IR normalization edge cases, baseline path management, diff readability. Implement every accepted improvement NOW and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(scripts): ...). - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
12.2 llvm-reduce Integration
File(s): diagnostics/reduce-ir.sh, diagnostics/reduce-tests/ (interestingness test scripts)
Integrate LLVM’s llvm-reduce tool for automatic test case reduction. When an IR regression is found (via baseline comparison, FileCheck failure, or manual investigation), llvm-reduce shrinks the failing LLVM IR to the minimum reproducing case — saving hours of manual reduction.
-
Create
diagnostics/reduce-ir.shfollowing diagnostic conventions:# Usage: diagnostics/reduce-ir.sh <file.ll> --test <interestingness_test> [OPTIONS] # # Reduces LLVM IR to the minimal reproducing case using llvm-reduce. # # Options: # --test SCRIPT Interestingness test script (REQUIRED) # --output FILE Output path for reduced IR (default: <input>.reduced.ll) # --timeout SECS Per-test timeout (default: 30) # --verbose Show llvm-reduce progress # # Built-in interestingness tests: # --crash Reduce to minimal crash reproducer # --wrong-output Reduce to minimal wrong-output case (requires --expected) # --leak Reduce to minimal leak case -
Create interestingness test templates in
diagnostics/reduce-tests/:crash-test.sh— passes if the input IR crashes the compiler/runtimewrong-output-test.sh— passes if the compiled IR produces output different from expectedleak-test.sh— passes ifORI_CHECK_LEAKS=1reports leakspattern-test.sh— passes if a specific IR pattern (grep) is present/absent
-
Verify
llvm-reduceis available on the system (it ships with LLVM 21):if ! command -v llvm-reduce-21 &>/dev/null && ! command -v llvm-reduce &>/dev/null; then echo "ERROR: llvm-reduce not found. Install LLVM 21 tools." exit 2 fi -
Add the script to
diagnostics/self-test.shwith a synthetic test case (a known-reducible IR file where the interestingness test checks for a specific pattern). -
TPR checkpoint —
/tpr-reviewcovering 12.1–12.2 implementation work -
Subsection close-out (12.2) — MANDATORY before starting 12.3:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection — same protocol as 12.1’s close-out, scoped to 12.2’s debugging journey. Commit improvements separately using a valid conventional-commit type. - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
12.3 Trend Tracking and CI Artifacts
File(s): scripts/verification-summary.sh, .github/workflows/nightly-verification.yml (artifact upload)
Capture verification metrics over time so that trends (increasing false positives, decreasing pass counts, growing IR size) are visible. This is the minimal “dashboard” — a structured JSON summary uploaded as a CI artifact, not a web UI.
-
Create
scripts/verification-summary.shthat aggregates metrics from all verification tools:# Usage: scripts/verification-summary.sh [--json] [--verbose] # # Collects metrics from: # - test-all.sh results (pass/fail/skip counts) # - AIMS snapshot diffs (changed/unchanged counts) # - FileCheck test results (pass/fail counts) # - IR baseline comparison (matched/changed/missing counts) # - Alive2 results (verified/timeout/suppressed/failed counts) # - Fuzzing metrics (corpus size, executions/sec, crashes found) # # Output: JSON summary suitable for historical comparison -
Define the summary JSON format:
{ "timestamp": "2026-04-10T02:00:00Z", "commit": "abc123", "metrics": { "tests": { "rust_pass": 1234, "rust_fail": 0, "ori_pass": 1677, "ori_fail": 0 }, "aims_snapshots": { "matched": 15, "changed": 0, "missing": 0 }, "filecheck": { "pass": 30, "fail": 0 }, "ir_baselines": { "matched": 20, "changed": 0, "missing": 0 }, "alive2": { "verified": 15, "timeout": 2, "suppressed": 3, "failed": 0 }, "fuzzing": { "corpus_size": 5000, "execs_per_sec": 150, "crashes": 0 } } } -
Add summary capture to the nightly CI workflow:
- name: Capture verification summary run: scripts/verification-summary.sh --json > verification-summary.json - uses: actions/upload-artifact@v4 with: name: verification-summary-${{ github.sha }} path: verification-summary.json retention-days: 365 # Keep for trend analysis -
Add IR baseline comparison to the nightly CI:
- name: IR baseline comparison run: scripts/ir-baseline.sh --compare --json > ir-baseline-results.json - uses: actions/upload-artifact@v4 with: name: ir-baselines-${{ github.sha }} path: ir-baseline-results.json -
Add a simple trend detection check to the nightly job. Compare the current summary against the previous day’s summary (downloaded from CI artifacts):
- If any metric worsened (e.g.,
ir_baselines.changed > 0when yesterday it was 0), add a warning annotation to the workflow run - If a critical metric crossed a threshold (e.g.,
alive2.failed > 0), fail the job - This is NOT a full dashboard — it is a regression detector
- If any metric worsened (e.g.,
-
Subsection close-out (12.3) — MANDATORY before starting 12.R:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection. - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
12.R Third Party Review Findings
- None.
12.N Completion Checklist
-
scripts/ir-baseline.shexists with--capture,--compare,--bless,--listmodes - IR normalization handles metadata, debug info, module-level metadata, attribute groups
- Golden IR baselines captured for ≥20 key programs in
tests/baselines/ -
--comparedetects IR shape changes with clear unified diff output -
--blessupdates baselines when changes are intentional -
diagnostics/reduce-ir.shwraps llvm-reduce with built-in interestingness tests - Interestingness test templates for crash, wrong-output, leak, and pattern modes
-
scripts/verification-summary.shaggregates metrics from all verification tools - Summary JSON uploaded as nightly CI artifact with 365-day retention
- IR baseline comparison runs in nightly CI
- Trend detection warns on metric regression
- Both new scripts added to
diagnostics/self-test.sh - No existing tests regressed:
timeout 150 ./test-all.shgreen -
timeout 150 ./clippy-all.shgreen - Plan annotation cleanup:
bash .claude/skills/impl-hygiene-review/plan-annotations.sh --plan 12returns 0 annotations - All intermediate TPR checkpoint findings resolved
- Plan sync — update plan metadata to reflect this section’s completion:
- This section’s frontmatter
status→complete, subsection statuses updated -
00-overview.mdQuick Reference table status updated for this section -
00-overview.mdmission success criteria checkboxes updated -
index.mdsection status updated
- This section’s frontmatter
-
/tpr-reviewpassed (final, full-section) -
/impl-hygiene-reviewpassed — AFTER/tpr-reviewis clean -
/improve-toolingsection-close sweep — verify per-subsection retrospectives ran, add cross-cutting items.
Exit Criteria: scripts/ir-baseline.sh --compare runs against ≥20 golden IR baselines with zero unexpected changes. --bless updates baselines cleanly. diagnostics/reduce-ir.sh reduces failing IR to minimal reproducers using llvm-reduce with built-in interestingness tests. scripts/verification-summary.sh produces a structured JSON summary aggregating all verification metrics. Nightly CI uploads baselines and summaries as artifacts with 365-day retention. Simple trend detection warns on metric regressions. All scripts follow diagnostic conventions and are registered in diagnostics/self-test.sh.