Section 11: CI Integration & ARC IR Parity

Status: Not Started Goal: Consolidate all verification tools from Sections 01-10 into a coherent CI pipeline with three tiers: every-commit (fast, high-signal), nightly (medium, comprehensive), and weekly (slow, exhaustive). Add ARC IR debug-vs-release parity checking (extending the existing debug-release-compare.sh) and an opt-bisect diagnostic script. This section does NOT create CI gates from scratch — per Codex feedback, each earlier section should already add its own CI gate. Section 11 verifies completeness, fills gaps, and ensures the tiered execution model is coherent.

CRITICAL BLOCKER: The LLVM crash escape hatch in test-all.sh (which masks LLVM backend crashes) is owned by plans/llvm-worker-isolation/. This section does NOT remove that escape hatch. If the escape hatch is still present when Section 11 starts, document it as a known limitation and add a  annotation.

Success Criteria:

Three CI tiers operational — satisfies mission criterion: “CI fully integrated”
ARC IR parity catches structural divergences — satisfies mission criterion: “ARC IR debug-vs-release parity”
opt-bisect script identifies failing LLVM passes — satisfies mission criterion: “diagnostic tooling”
No verification tool exists only locally — satisfies mission criterion: “CI runs all verification”

Context: The current CI (.github/workflows/ci.yml) runs three test suites: Rust workspace tests (cargo test --workspace), Ori spec tests (interpreter only, via cargo run -p oric -- test tests/), and runtime tests (cargo test -p ori_rt). It is MISSING: LLVM backend spec tests (ori test --backend=llvm tests/), FileCheck IR tests, sanitizer instrumentation, AIMS snapshot verification, Alive2 refinement checking, and differential fuzzing. Each of Sections 01-10 adds its own CI gate incrementally; this section verifies that all gates are present and organizes them into the tiered execution model described in the research document.

Reference implementations:

Rust .github/workflows/: Separate workflows for PR CI (fast), scheduled CI (nightly with extra sanitizer jobs), and weekly jobs (fuzzing, extensive testing).
Swift utils/build-script: Tiered build modes with --validation-test (nightly) and --stress-test (weekly).
LLVM .github/workflows/: Separate sanitizer, coverage, and fuzz jobs on different schedules.

Depends on: All previous sections (01-10). This is the integration section.

11.1 Every-Commit CI Tier

File(s): .github/workflows/ci.yml

The every-commit tier runs on every PR. It must be fast (add at most 3 minutes to current CI) and high-signal (catch the most common regressions). These gates should already exist from Sections 01-08 — this subsection audits and fills gaps.

Audit existing CI for gates that should already be present from earlier sections:
- §01: ORI_VERIFY_EACH=1 and ORI_VERIFY_ARC=1 in env block
- §01: Function-level fn_val.verify() (implicit — runs during cargo test --workspace)
- §07: FileCheck tests in compiler/ori_llvm/tests/codegen/ (via cargo test --workspace if integrated as Rust tests, or explicit ori test --backend=llvm compiler/ori_llvm/tests/codegen/)
- §08: Sanitizer smoke (if §08 added a smoke job — check)
- MISSING (known): ori test --backend=llvm tests/ — LLVM backend spec tests
Ensure --emit build path runs CaptureHooks for IR capture (deferred from [TPR-09-026-codex] — currently --emit bypasses alive2 capture, which means developer-facing builds can’t be verified)

Add LLVM backend spec tests to the every-commit CI if not already present:

- name: Ori LLVM backend tests
  run: |
    set -o pipefail
    cargo run -p oric --bin ori -- test --backend=llvm tests/ 2>&1 | tee llvm-test-output.txt
    LLVM_TESTS=$(grep -oE '[0-9]+ passed' llvm-test-output.txt | grep -oE '^[0-9]+' | tail -1 || echo "0")
    LLVM_FAILED=$(grep -oE '[0-9]+ failed' llvm-test-output.txt | grep -oE '^[0-9]+' | tail -1 || echo "0")
    echo "LLVM_TESTS=$LLVM_TESTS" >> $GITHUB_ENV
    echo "LLVM_FAILED=$LLVM_FAILED" >> $GITHUB_ENV
  env:
    LLVM_SYS_211_PREFIX: /usr/lib/llvm-21
    ORI_VERIFY_EACH: "1"
    ORI_VERIFY_ARC: "1"

Verify the total CI time stays within the performance budget. Current test job timeout is 30 minutes. The every-commit additions should add at most 3 minutes:
- ORI_VERIFY_EACH=1 adds ~30-60% to LLVM test time (from research)
- FileCheck tests: fast (simple pattern matching)
- LLVM backend spec tests: medium (AOT compile + execute per test)
- If the total exceeds 30 minutes, parallelize by splitting into multiple jobs

Add the verification env vars to ALL test steps in the test job, not just the LLVM-specific one:

env:
  ORI_VERIFY_EACH: "1"
  ORI_VERIFY_ARC: "1"
  LLVM_SYS_211_PREFIX: /usr/lib/llvm-21

Update the test results summary to include LLVM backend test counts in test-results.json.
Subsection close-out (11.1) — MANDATORY before starting 11.2:
- All tasks above are [x] and the subsection’s behavior is verified
- Update this subsection’s status in section frontmatter to complete
- Run /improve-tooling retrospectively on THIS subsection — reflect on the debugging journey for 11.1 specifically: CI configuration debugging, workflow syntax issues, timing analysis. Implement every accepted improvement NOW and commit each via SEPARATE /commit-push using a valid conventional-commit type (ci: ...).
- Run /sync-claude on THIS subsection — check whether code changes invalidated any CLAUDE.md, .claude/rules/*.md, or canon.md claims. If no API/command/phase changes, document briefly. Fix any drift NOW.
- Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

11.2 Nightly CI Tier

File(s): .github/workflows/nightly-verification.yml (new workflow, or extend existing nightly.yml)

The nightly tier runs on a schedule (e.g., 2:00 AM UTC). It runs more expensive verification that is too slow for every-commit CI but should catch issues before they accumulate.

Create .github/workflows/nightly-verification.yml (separate from the existing nightly.yml which handles release PRs):

name: Nightly Verification

on:
  schedule:
    - cron: '0 2 * * *'  # 2:00 AM UTC daily
  workflow_dispatch:       # Manual trigger

jobs:
  sanitizers:
    name: Sanitizer Suite
    runs-on: ubuntu-latest
    timeout-minutes: 45
    steps:
      # ... LLVM + Rust setup
      - name: ASan/UBSan smoke
        run: # ... from §08's CI gate
        env:
          ORI_SANITIZE: "address,undefined"

  alive2:
    name: Alive2 Curated Corpus
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      # ... from §09.5's nightly job

  aims-snapshots:
    name: AIMS Snapshot Verification
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      # ... build + cargo test -p oric --test aims_snapshots

  arc-parity:
    name: ARC IR Parity (Debug vs Release)
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      # ... from §11.4 below

Harden Alive2 suppression workflow — deferred from Section 09 TPR:
- Suppressions must run alive-tv with suppression-aware checking so stale suppressions are detected when the underlying false positive is resolved ([TPR-09-022-codex])
- Normalize artifact paths to repo-relative in both ir_capture.rs and alive2-verify.sh to eliminate absolute vs relative path inconsistency ([TPR-09-023-codex])
- Add suppression-stale status to tests/alive2/results-schema.json so the schema can represent stale suppressions ([TPR-09-024-codex], depends on TPR-09-022)
- scripts/build-alive2.sh --cached must verify the cached binary matches the pinned commit hash, not just that a cached binary exists ([TPR-09-025-codex])
- Z3 preflight detection in scripts/build-alive2.sh must check for dev headers/libs, not just CLI presence — cmake FindZ3 handles build-time discovery but preflight should warn early ([TPR-09-027-codex])
Audit which nightly gates should already exist from earlier sections:
- §03: AIMS snapshot tests (cargo test -p oric --test aims_snapshots)
- §08: Full sanitizer suite (ASan/UBSan on AOT smoke subset)
- §09: Alive2 curated corpus (diagnostics/alive2-verify.sh --corpus)

Add failure notification. Nightly failures must notify (unlike weekly, which is informational):

notify:
  name: Notify on Failure
  needs: [sanitizers, alive2, aims-snapshots, arc-parity]
  if: failure()
  runs-on: ubuntu-latest
  steps:
    - name: Create issue
      uses: actions/github-script@v7
      with:
        script: |
          github.rest.issues.create({
            owner: context.repo.owner,
            repo: context.repo.repo,
            title: `Nightly verification failed: ${new Date().toISOString().slice(0,10)}`,
            body: `Nightly verification pipeline failed. Check workflow run.`,
            labels: ['nightly-failure', 'verification']
          })

TPR checkpoint — /tpr-review covering 11.1–11.2 implementation work
Subsection close-out (11.2) — MANDATORY before starting 11.3:
- All tasks above are [x] and the subsection’s behavior is verified
- Update this subsection’s status in section frontmatter to complete
- Run /improve-tooling retrospectively on THIS subsection — same protocol as 11.1’s close-out, scoped to 11.2’s debugging journey. Commit improvements separately using a valid conventional-commit type.
- Run /sync-claude on THIS subsection — check whether code changes invalidated any CLAUDE.md, .claude/rules/*.md, or canon.md claims. If no API/command/phase changes, document briefly. Fix any drift NOW.
- Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

11.3 Weekly CI Tier

File(s): .github/workflows/weekly-verification.yml (new workflow)

The weekly tier runs expensive, exhaustive verification: full fuzzing campaigns, complete Alive2 sweeps, and full sanitizer matrices. Results are informational — failures create tracking issues but do not block development.

Create .github/workflows/weekly-verification.yml:

name: Weekly Verification

on:
  schedule:
    - cron: '0 4 * * 0'  # 4:00 AM UTC every Sunday
  workflow_dispatch:

jobs:
  fuzz:
    name: Differential Fuzzing
    runs-on: ubuntu-latest
    timeout-minutes: 240  # 4 hours
    steps:
      # ... from §10.5's weekly job

  alive2-full:
    name: Alive2 Full Sweep
    runs-on: ubuntu-latest
    timeout-minutes: 120
    steps:
      # ... from §09.5's weekly job

  sanitizer-matrix:
    name: Full Sanitizer Matrix
    runs-on: ubuntu-latest
    timeout-minutes: 60
    strategy:
      matrix:
        sanitizer: [address, undefined]
        # MSan requires full-program instrumentation — separate job
    steps:
      # ... per-sanitizer full sweep from §08

Upload all results as CI artifacts for historical comparison (consumed by §12):

- uses: actions/upload-artifact@v4
  with:
    name: weekly-verification-${{ github.run_id }}
    path: |
      build/alive2-results/
      fuzz/artifacts/
      build/sanitizer-results/
    retention-days: 90

Weekly failures create tracking issues but do NOT block merges:

if: failure()
steps:
  - name: Create tracking issue
    uses: actions/github-script@v7
    with:
      script: |
        github.rest.issues.create({
          owner: context.repo.owner,
          repo: context.repo.repo,
          title: `Weekly verification: findings on ${new Date().toISOString().slice(0,10)}`,
          body: `Weekly verification found issues. Triage required.`,
          labels: ['weekly-verification', 'triage-needed']
        })

Subsection close-out (11.3) — MANDATORY before starting 11.4:
- All tasks above are [x] and the subsection’s behavior is verified
- Update this subsection’s status in section frontmatter to complete
- Run /improve-tooling retrospectively on THIS subsection.
- Run /sync-claude on THIS subsection — check whether code changes invalidated any CLAUDE.md, .claude/rules/*.md, or canon.md claims. If no API/command/phase changes, document briefly. Fix any drift NOW.
- Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

11.4 ARC IR Debug-vs-Release Parity

File(s): diagnostics/debug-release-compare.sh

Extend the existing debug-release-compare.sh to capture and compare ARC IR between debug and release builds. Currently the script compares behavioral output (exit codes + stdout) and LLVM IR diffs — but not ARC IR. AIMS pipeline divergences between debug and release (due to different optimization flags or analysis precision) can cause subtle behavioral drift masked by LLVM optimization.

Add --arc-ir flag to debug-release-compare.sh that enables ARC IR comparison:

# In the debug build step:
ORI_DUMP_AFTER_ARC=1 cargo run -- build "$file" -o "$debug_binary" 2>"$debug_arc_ir"

# In the release build step:
ORI_DUMP_AFTER_ARC=1 cargo run --release -- build "$file" -o "$release_binary" 2>"$release_arc_ir"

# Compare ARC IR (structural diff, ignoring whitespace and variable IDs)
diff_arc_ir "$debug_arc_ir" "$release_arc_ir"

Implement ARC IR diff normalization. ARC IR uses generated variable IDs (v0, v1, …) that may differ between debug and release builds due to different allocation patterns. The diff must normalize:
- Variable IDs: v<N> → canonical renumbering based on first occurrence
- Block IDs: bb<N> → canonical renumbering based on first occurrence
- Whitespace: normalize to single spaces
- Keep all instructions, RC operations, and control flow intact

Add ARC IR parity to the nightly CI (§11.2) with a representative subset of test programs:

# Run on key programs that exercise different ARC patterns:
for f in tests/spec/traits/iterator/*.ori tests/spec/collections/cow/*.ori; do
    diagnostics/debug-release-compare.sh --arc-ir "$f" || PARITY_FAILURES=$((PARITY_FAILURES + 1))
done

TPR checkpoint — /tpr-review covering 11.3–11.4 implementation work
Subsection close-out (11.4) — MANDATORY before starting 11.5:
- All tasks above are [x] and the subsection’s behavior is verified
- Update this subsection’s status in section frontmatter to complete
- Run /improve-tooling retrospectively on THIS subsection.
- Run /sync-claude on THIS subsection — check whether code changes invalidated any CLAUDE.md, .claude/rules/*.md, or canon.md claims. If no API/command/phase changes, document briefly. Fix any drift NOW.
- Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

11.5 opt-bisect Diagnostic Script

File(s): diagnostics/opt-bisect.sh

Create a diagnostic script that wraps LLVM’s opt --opt-bisect-limit to binary-search which LLVM optimization pass breaks a program. This is distinct from AIMS phase bisection (diagnostics/bisect-passes.sh, which bisects the 12-step AIMS pipeline) — opt-bisect bisects LLVM’s own optimization passes (instcombine, GVN, SROA, etc.).

Create diagnostics/opt-bisect.sh following diagnostic conventions (--help, --no-color, --verbose, --json, exit codes 0/1/2):

# Usage: diagnostics/opt-bisect.sh <file.ori> [OPTIONS]
#
# Binary-searches which LLVM optimization pass breaks the program.
# The "broken" condition is: the optimized binary produces different
# output than the unoptimized binary, OR the optimized binary crashes.
#
# Options:
#   --expected OUTPUT  Expected stdout (default: captured from -O0 build)
#   --check-leaks      Also check ORI_CHECK_LEAKS divergence
#   --verbose          Show each bisection step
#   --json             Machine-readable output

Implement the bisection algorithm:
1. Build with -O0 (no optimizations) — capture expected output
2. Build with full optimization — verify the bug reproduces (different output or crash)
3. Binary search using opt --opt-bisect-limit=N:
  - Set LLVM_OPT_BISECT_LIMIT=N environment variable (LLVM respects this)
  - Build and run with limit N
  - Compare output to expected
  - Narrow the range until the specific pass is identified
4. Report: “Pass N ({pass_name}) at function {func_name} introduces the miscompile”
Handle the Ori-specific integration: LLVM’s opt-bisect-limit is read via the LLVM_OPT_BISECT_LIMIT environment variable by the LLVM optimization pipeline. Ori’s run_optimization_passes must pass this through. Verify that the existing LLVM C API / Inkwell integration respects this env var.
Add the script to diagnostics/self-test.sh with a positive test (a program that compiles correctly at all optimization levels — the script should report “no miscompile found”).
Subsection close-out (11.5) — MANDATORY before starting 11.R:
- All tasks above are [x] and the subsection’s behavior is verified
- Update this subsection’s status in section frontmatter to complete
- Run /improve-tooling retrospectively on THIS subsection.
- Run /sync-claude on THIS subsection — check whether code changes invalidated any CLAUDE.md, .claude/rules/*.md, or canon.md claims. If no API/command/phase changes, document briefly. Fix any drift NOW.
- Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files.

11.R Third Party Review Findings

None.

11.N Completion Checklist

Exit Criteria: Three CI tiers operational and tested. Every-commit tier runs verification gates from §01 plus LLVM backend spec tests and FileCheck tests within the 30-minute budget. Nightly tier runs sanitizers, Alive2, AIMS snapshots, and ARC IR parity with automatic failure notification. Weekly tier runs differential fuzzing, full Alive2 sweep, and sanitizer matrix with artifact upload. debug-release-compare.sh --arc-ir catches ARC IR structural divergences with normalized diffing. diagnostics/opt-bisect.sh identifies failing LLVM passes via binary search. No verification tool from §01-§10 exists only locally without CI enforcement.