Section 11: CI Integration & ARC IR Parity
Status: Not Started
Goal: Consolidate all verification tools from Sections 01-10 into a coherent CI pipeline with three tiers: every-commit (fast, high-signal), nightly (medium, comprehensive), and weekly (slow, exhaustive). Add ARC IR debug-vs-release parity checking (extending the existing debug-release-compare.sh) and an opt-bisect diagnostic script. This section does NOT create CI gates from scratch — per Codex feedback, each earlier section should already add its own CI gate. Section 11 verifies completeness, fills gaps, and ensures the tiered execution model is coherent.
CRITICAL BLOCKER: The LLVM crash escape hatch in test-all.sh (which masks LLVM backend crashes) is owned by plans/llvm-worker-isolation/. This section does NOT remove that escape hatch. If the escape hatch is still present when Section 11 starts, document it as a known limitation and add a <!-- blocked-by: llvm-worker-isolation --> annotation.
Success Criteria:
- Three CI tiers operational — satisfies mission criterion: “CI fully integrated”
- ARC IR parity catches structural divergences — satisfies mission criterion: “ARC IR debug-vs-release parity”
- opt-bisect script identifies failing LLVM passes — satisfies mission criterion: “diagnostic tooling”
- No verification tool exists only locally — satisfies mission criterion: “CI runs all verification”
Context: The current CI (.github/workflows/ci.yml) runs three test suites: Rust workspace tests (cargo test --workspace), Ori spec tests (interpreter only, via cargo run -p oric -- test tests/), and runtime tests (cargo test -p ori_rt). It is MISSING: LLVM backend spec tests (ori test --backend=llvm tests/), FileCheck IR tests, sanitizer instrumentation, AIMS snapshot verification, Alive2 refinement checking, and differential fuzzing. Each of Sections 01-10 adds its own CI gate incrementally; this section verifies that all gates are present and organizes them into the tiered execution model described in the research document.
Reference implementations:
- Rust
.github/workflows/: Separate workflows for PR CI (fast), scheduled CI (nightly with extra sanitizer jobs), and weekly jobs (fuzzing, extensive testing). - Swift
utils/build-script: Tiered build modes with--validation-test(nightly) and--stress-test(weekly). - LLVM
.github/workflows/: Separate sanitizer, coverage, and fuzz jobs on different schedules.
Depends on: All previous sections (01-10). This is the integration section.
11.1 Every-Commit CI Tier
File(s): .github/workflows/ci.yml
The every-commit tier runs on every PR. It must be fast (add at most 3 minutes to current CI) and high-signal (catch the most common regressions). These gates should already exist from Sections 01-08 — this subsection audits and fills gaps.
-
Audit existing CI for gates that should already be present from earlier sections:
- §01:
ORI_VERIFY_EACH=1andORI_VERIFY_ARC=1in env block - §01: Function-level
fn_val.verify()(implicit — runs duringcargo test --workspace) - §07: FileCheck tests in
compiler/ori_llvm/tests/codegen/(viacargo test --workspaceif integrated as Rust tests, or explicitori test --backend=llvm compiler/ori_llvm/tests/codegen/) - §08: Sanitizer smoke (if §08 added a smoke job — check)
- MISSING (known):
ori test --backend=llvm tests/— LLVM backend spec tests
- §01:
-
Ensure
--emitbuild path runsCaptureHooksfor IR capture (deferred from[TPR-09-026-codex]— currently--emitbypasses alive2 capture, which means developer-facing builds can’t be verified) -
Add LLVM backend spec tests to the every-commit CI if not already present:
- name: Ori LLVM backend tests run: | set -o pipefail cargo run -p oric --bin ori -- test --backend=llvm tests/ 2>&1 | tee llvm-test-output.txt LLVM_TESTS=$(grep -oE '[0-9]+ passed' llvm-test-output.txt | grep -oE '^[0-9]+' | tail -1 || echo "0") LLVM_FAILED=$(grep -oE '[0-9]+ failed' llvm-test-output.txt | grep -oE '^[0-9]+' | tail -1 || echo "0") echo "LLVM_TESTS=$LLVM_TESTS" >> $GITHUB_ENV echo "LLVM_FAILED=$LLVM_FAILED" >> $GITHUB_ENV env: LLVM_SYS_211_PREFIX: /usr/lib/llvm-21 ORI_VERIFY_EACH: "1" ORI_VERIFY_ARC: "1" -
Verify the total CI time stays within the performance budget. Current
testjob timeout is 30 minutes. The every-commit additions should add at most 3 minutes:ORI_VERIFY_EACH=1adds ~30-60% to LLVM test time (from research)- FileCheck tests: fast (simple pattern matching)
- LLVM backend spec tests: medium (AOT compile + execute per test)
- If the total exceeds 30 minutes, parallelize by splitting into multiple jobs
-
Add the verification env vars to ALL test steps in the
testjob, not just the LLVM-specific one:env: ORI_VERIFY_EACH: "1" ORI_VERIFY_ARC: "1" LLVM_SYS_211_PREFIX: /usr/lib/llvm-21 -
Update the test results summary to include LLVM backend test counts in
test-results.json. -
Subsection close-out (11.1) — MANDATORY before starting 11.2:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection — reflect on the debugging journey for 11.1 specifically: CI configuration debugging, workflow syntax issues, timing analysis. Implement every accepted improvement NOW and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (ci: ...). - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
11.2 Nightly CI Tier
File(s): .github/workflows/nightly-verification.yml (new workflow, or extend existing nightly.yml)
The nightly tier runs on a schedule (e.g., 2:00 AM UTC). It runs more expensive verification that is too slow for every-commit CI but should catch issues before they accumulate.
-
Create
.github/workflows/nightly-verification.yml(separate from the existingnightly.ymlwhich handles release PRs):name: Nightly Verification on: schedule: - cron: '0 2 * * *' # 2:00 AM UTC daily workflow_dispatch: # Manual trigger jobs: sanitizers: name: Sanitizer Suite runs-on: ubuntu-latest timeout-minutes: 45 steps: # ... LLVM + Rust setup - name: ASan/UBSan smoke run: # ... from §08's CI gate env: ORI_SANITIZE: "address,undefined" alive2: name: Alive2 Curated Corpus runs-on: ubuntu-latest timeout-minutes: 30 steps: # ... from §09.5's nightly job aims-snapshots: name: AIMS Snapshot Verification runs-on: ubuntu-latest timeout-minutes: 15 steps: # ... build + cargo test -p oric --test aims_snapshots arc-parity: name: ARC IR Parity (Debug vs Release) runs-on: ubuntu-latest timeout-minutes: 20 steps: # ... from §11.4 below -
Harden Alive2 suppression workflow — deferred from Section 09 TPR:
- Suppressions must run alive-tv with suppression-aware checking so stale suppressions are detected when the underlying false positive is resolved (
[TPR-09-022-codex]) - Normalize artifact paths to repo-relative in both
ir_capture.rsandalive2-verify.shto eliminate absolute vs relative path inconsistency ([TPR-09-023-codex]) - Add
suppression-stalestatus totests/alive2/results-schema.jsonso the schema can represent stale suppressions ([TPR-09-024-codex], depends on TPR-09-022) -
scripts/build-alive2.sh --cachedmust verify the cached binary matches the pinned commit hash, not just that a cached binary exists ([TPR-09-025-codex]) - Z3 preflight detection in
scripts/build-alive2.shmust check for dev headers/libs, not just CLI presence — cmake FindZ3 handles build-time discovery but preflight should warn early ([TPR-09-027-codex])
- Suppressions must run alive-tv with suppression-aware checking so stale suppressions are detected when the underlying false positive is resolved (
-
Audit which nightly gates should already exist from earlier sections:
- §03: AIMS snapshot tests (
cargo test -p oric --test aims_snapshots) - §08: Full sanitizer suite (ASan/UBSan on AOT smoke subset)
- §09: Alive2 curated corpus (
diagnostics/alive2-verify.sh --corpus)
- §03: AIMS snapshot tests (
-
Add failure notification. Nightly failures must notify (unlike weekly, which is informational):
notify: name: Notify on Failure needs: [sanitizers, alive2, aims-snapshots, arc-parity] if: failure() runs-on: ubuntu-latest steps: - name: Create issue uses: actions/github-script@v7 with: script: | github.rest.issues.create({ owner: context.repo.owner, repo: context.repo.repo, title: `Nightly verification failed: ${new Date().toISOString().slice(0,10)}`, body: `Nightly verification pipeline failed. Check workflow run.`, labels: ['nightly-failure', 'verification'] }) -
TPR checkpoint —
/tpr-reviewcovering 11.1–11.2 implementation work -
Subsection close-out (11.2) — MANDATORY before starting 11.3:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection — same protocol as 11.1’s close-out, scoped to 11.2’s debugging journey. Commit improvements separately using a valid conventional-commit type. - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
11.3 Weekly CI Tier
File(s): .github/workflows/weekly-verification.yml (new workflow)
The weekly tier runs expensive, exhaustive verification: full fuzzing campaigns, complete Alive2 sweeps, and full sanitizer matrices. Results are informational — failures create tracking issues but do not block development.
-
Create
.github/workflows/weekly-verification.yml:name: Weekly Verification on: schedule: - cron: '0 4 * * 0' # 4:00 AM UTC every Sunday workflow_dispatch: jobs: fuzz: name: Differential Fuzzing runs-on: ubuntu-latest timeout-minutes: 240 # 4 hours steps: # ... from §10.5's weekly job alive2-full: name: Alive2 Full Sweep runs-on: ubuntu-latest timeout-minutes: 120 steps: # ... from §09.5's weekly job sanitizer-matrix: name: Full Sanitizer Matrix runs-on: ubuntu-latest timeout-minutes: 60 strategy: matrix: sanitizer: [address, undefined] # MSan requires full-program instrumentation — separate job steps: # ... per-sanitizer full sweep from §08 -
Upload all results as CI artifacts for historical comparison (consumed by §12):
- uses: actions/upload-artifact@v4 with: name: weekly-verification-${{ github.run_id }} path: | build/alive2-results/ fuzz/artifacts/ build/sanitizer-results/ retention-days: 90 -
Weekly failures create tracking issues but do NOT block merges:
if: failure() steps: - name: Create tracking issue uses: actions/github-script@v7 with: script: | github.rest.issues.create({ owner: context.repo.owner, repo: context.repo.repo, title: `Weekly verification: findings on ${new Date().toISOString().slice(0,10)}`, body: `Weekly verification found issues. Triage required.`, labels: ['weekly-verification', 'triage-needed'] }) -
Subsection close-out (11.3) — MANDATORY before starting 11.4:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection. - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
11.4 ARC IR Debug-vs-Release Parity
File(s): diagnostics/debug-release-compare.sh
Extend the existing debug-release-compare.sh to capture and compare ARC IR between debug and release builds. Currently the script compares behavioral output (exit codes + stdout) and LLVM IR diffs — but not ARC IR. AIMS pipeline divergences between debug and release (due to different optimization flags or analysis precision) can cause subtle behavioral drift masked by LLVM optimization.
-
Add
--arc-irflag todebug-release-compare.shthat enables ARC IR comparison:# In the debug build step: ORI_DUMP_AFTER_ARC=1 cargo run -- build "$file" -o "$debug_binary" 2>"$debug_arc_ir" # In the release build step: ORI_DUMP_AFTER_ARC=1 cargo run --release -- build "$file" -o "$release_binary" 2>"$release_arc_ir" # Compare ARC IR (structural diff, ignoring whitespace and variable IDs) diff_arc_ir "$debug_arc_ir" "$release_arc_ir" -
Implement ARC IR diff normalization. ARC IR uses generated variable IDs (
v0,v1, …) that may differ between debug and release builds due to different allocation patterns. The diff must normalize:- Variable IDs:
v<N>→ canonical renumbering based on first occurrence - Block IDs:
bb<N>→ canonical renumbering based on first occurrence - Whitespace: normalize to single spaces
- Keep all instructions, RC operations, and control flow intact
- Variable IDs:
-
Add ARC IR parity to the nightly CI (§11.2) with a representative subset of test programs:
# Run on key programs that exercise different ARC patterns: for f in tests/spec/traits/iterator/*.ori tests/spec/collections/cow/*.ori; do diagnostics/debug-release-compare.sh --arc-ir "$f" || PARITY_FAILURES=$((PARITY_FAILURES + 1)) done -
TPR checkpoint —
/tpr-reviewcovering 11.3–11.4 implementation work -
Subsection close-out (11.4) — MANDATORY before starting 11.5:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection. - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
11.5 opt-bisect Diagnostic Script
File(s): diagnostics/opt-bisect.sh
Create a diagnostic script that wraps LLVM’s opt --opt-bisect-limit to binary-search which LLVM optimization pass breaks a program. This is distinct from AIMS phase bisection (diagnostics/bisect-passes.sh, which bisects the 12-step AIMS pipeline) — opt-bisect bisects LLVM’s own optimization passes (instcombine, GVN, SROA, etc.).
-
Create
diagnostics/opt-bisect.shfollowing diagnostic conventions (--help,--no-color,--verbose,--json, exit codes 0/1/2):# Usage: diagnostics/opt-bisect.sh <file.ori> [OPTIONS] # # Binary-searches which LLVM optimization pass breaks the program. # The "broken" condition is: the optimized binary produces different # output than the unoptimized binary, OR the optimized binary crashes. # # Options: # --expected OUTPUT Expected stdout (default: captured from -O0 build) # --check-leaks Also check ORI_CHECK_LEAKS divergence # --verbose Show each bisection step # --json Machine-readable output -
Implement the bisection algorithm:
- Build with
-O0(no optimizations) — capture expected output - Build with full optimization — verify the bug reproduces (different output or crash)
- Binary search using
opt --opt-bisect-limit=N:- Set
LLVM_OPT_BISECT_LIMIT=Nenvironment variable (LLVM respects this) - Build and run with limit N
- Compare output to expected
- Narrow the range until the specific pass is identified
- Set
- Report: “Pass N ({pass_name}) at function {func_name} introduces the miscompile”
- Build with
-
Handle the Ori-specific integration: LLVM’s opt-bisect-limit is read via the
LLVM_OPT_BISECT_LIMITenvironment variable by the LLVM optimization pipeline. Ori’srun_optimization_passesmust pass this through. Verify that the existing LLVM C API / Inkwell integration respects this env var. -
Add the script to
diagnostics/self-test.shwith a positive test (a program that compiles correctly at all optimization levels — the script should report “no miscompile found”). -
Subsection close-out (11.5) — MANDATORY before starting 11.R:
- All tasks above are
[x]and the subsection’s behavior is verified - Update this subsection’s
statusin section frontmatter tocomplete - Run
/improve-toolingretrospectively on THIS subsection. - Run
/sync-claudeon THIS subsection — check whether code changes invalidated any CLAUDE.md,.claude/rules/*.md, orcanon.mdclaims. If no API/command/phase changes, document briefly. Fix any drift NOW. - Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
- All tasks above are
11.R Third Party Review Findings
- None.
11.N Completion Checklist
- Every-commit CI runs:
ORI_VERIFY_EACH=1,ORI_VERIFY_ARC=1, LLVM backend spec tests, FileCheck tests - Every-commit CI total time within 30-minute budget
- Nightly CI runs: sanitizers, Alive2 curated corpus, AIMS snapshots, ARC IR parity
- Nightly failure creates GitHub issue automatically
- Weekly CI runs: differential fuzzing, Alive2 full sweep, sanitizer matrix
- Weekly results uploaded as CI artifacts with 90-day retention
-
debug-release-compare.sh --arc-irproduces normalized ARC IR diffs - ARC IR variable/block ID normalization prevents false positives
-
diagnostics/opt-bisect.shidentifies failing LLVM optimization pass - opt-bisect added to
diagnostics/self-test.sh - LLVM crash escape hatch status documented (blocked by
plans/llvm-worker-isolation/) - All §01-§10 CI gates verified present and functional
- No existing tests regressed:
timeout 150 ./test-all.shgreen -
timeout 150 ./clippy-all.shgreen - Plan annotation cleanup:
bash .claude/skills/impl-hygiene-review/plan-annotations.sh --plan 11returns 0 annotations - All intermediate TPR checkpoint findings resolved
- Plan sync — update plan metadata to reflect this section’s completion:
- This section’s frontmatter
status→complete, subsection statuses updated -
00-overview.mdQuick Reference table status updated for this section -
00-overview.mdmission success criteria checkboxes updated -
index.mdsection status updated
- This section’s frontmatter
-
/tpr-reviewpassed (final, full-section) -
/impl-hygiene-reviewpassed — AFTER/tpr-reviewis clean -
/improve-toolingsection-close sweep — verify per-subsection retrospectives ran, add cross-cutting items.
Exit Criteria: Three CI tiers operational and tested. Every-commit tier runs verification gates from §01 plus LLVM backend spec tests and FileCheck tests within the 30-minute budget. Nightly tier runs sanitizers, Alive2, AIMS snapshots, and ARC IR parity with automatic failure notification. Weekly tier runs differential fuzzing, full Alive2 sweep, and sanitizer matrix with artifact upload. debug-release-compare.sh --arc-ir catches ARC IR structural divergences with normalized diffing. diagnostics/opt-bisect.sh identifies failing LLVM passes via binary search. No verification tool from §01-§10 exists only locally without CI enforcement.