Section 04: AOT Pipeline Optimization
Status: Not Started Goal: AOT integration test execution drops from 35.6s to ≤15s. All ~1,950 tests pass identically. No test code modified.
Context: The AOT integration tests (compiler/ori_llvm/tests/aot/) account for 60% of cargo t wall time. Each of the ~1,950 tests spawns TWO subprocesses: (1) ori build (full compile pipeline: lex→parse→typeck→ARC→LLVM→link) and (2) the compiled binary. Each test also creates a TempDir, writes source to disk, and cleans up afterward. At ~18ms/test average, the per-test cost is already lean, but the cumulative effect of ~1,950 cycles (with ~3,900 process spawns) is the dominant bottleneck. Section 03’s per-phase timing will reveal which phase (compile/link/execute/overhead) to attack first.
Feasibility analysis:
- Current: 35.6s / ~1,950 tests = ~18.3 ms/test
- Target (≤15s): 15s / ~1,950 tests = ~7.7 ms/test
- Required speedup: 2.4x
- This is aggressive. The per-test overhead includes two
Command::new()spawns, file I/O, and a fullori buildinvocation. If profiling reveals that most per-test time is in irreducible work (process spawning + LLVM compilation is inherently expensive), the target may need revision. The plan treats ≤15s as aspirational; the actual target is “whatever profiling shows is achievable with reasonable effort.” If 20s is the achievable floor, that’s still a 44% improvement and should be accepted.
Depends on: Section 03 (Profiling Infrastructure) — the per-phase timing data guides which optimizations have the highest ROI.
Baseline coordination: Linker changes (04.1) affect baseline measurements. Record a “pre-linker” baseline during Section 03, then record a “post-linker” measurement after Section 04.1. The final comparison in Section 06 uses the original Section 03 baseline as the “before” and Section 06’s measurement as the “after.”
04.1 Linker Optimization
File(s): compiler/ori_llvm/src/aot/linker/ (linker drivers: gcc.rs, msvc.rs, wasm/), compiler/ori_llvm/src/aot/linker/driver.rs (linker driver selection)
Linking is often the slowest phase in compile→link→execute cycles. Switching from the system linker (ld/cc) to a faster alternative can yield dramatic improvements.
-
Check Section 03’s per-phase timing to determine what percentage of AOT test time is spent in the linker phase. If linking is <10% of total, skip this subsection.
-
Check which linker the AOT tests currently use:
# The linker is selected in the AOT linker module: grep -r "Command::new\|cc\|gcc\|ld\|clang" compiler/ori_llvm/src/aot/linker/gcc.rs grep -r "linker\|link_command\|link_args" compiler/ori_llvm/src/aot/linker/driver.rsThe
GccLinker(ingcc.rs) is used on Linux/macOS and invokesccby default. -
Check if
mold(fastest linker for Linux) is available:mold --version # NOT installed as of 2026-03-25. Install: sudo apt install mold -
Check if
lld(LLVM’s linker, faster than system ld) is available:/usr/lib/llvm-21/bin/ld.lld --version # AVAILABLE via LLVM 21 at /usr/lib/llvm-21/bin/ld.lldRecommendation: Start with
lldsince it is already available. Only installmoldiflldprovides <10% improvement over systemld. -
Implement
ORI_LINKERenv var support. Currentlycompiler/ori_llvm/src/aot/linker/driver.rsline 30 selects flavor viaLinkerFlavor::for_target()with no env var override. Implementation:- WHERE:
compiler/ori_llvm/src/aot/linker/driver.rs, inLinkerDriver::link()(line 30), before theinput.linker.unwrap_or_else(...)call - WHAT: Read
std::env::var("ORI_LINKER"). If set:- If value is
"lld"or path ends inld.lld: setflavor = LinkerFlavor::Lld - If value is
"mold": createGccLinker::with_path(target, "cc")and add-fuse-ld=moldarg - If value is a path: create
GccLinker::with_path(target, &value)(custom linker binary)
- If value is
- TEST: Add a test in
compiler/ori_llvm/src/aot/linker/tests.rsthat verifiesORI_LINKERenv var is respected - Usage:
ORI_LINKER=lld cargo test -p ori_llvm --test aot # Or with explicit path: ORI_LINKER=/usr/lib/llvm-21/bin/ld.lld cargo test -p ori_llvm --test aot - WHERE:
-
Measure the improvement:
# Before (system linker): ORI_TEST_TIMING=1 cargo test -p ori_llvm --test aot 2>&1 | grep "Link:" # After (mold): ORI_LINKER=mold ORI_TEST_TIMING=1 cargo test -p ori_llvm --test aot 2>&1 | grep "Link:" -
If the linker optimization provides measurable improvement (>10%), make it the default for test builds. Add a note in CLAUDE.md about the linker configuration.
Test Strategy
-
TDD ordering:
- Write a Rust unit test in
compiler/ori_llvm/src/aot/linker/tests.rsthat verifiesORI_LINKERenv var selects the correct linker flavor BEFORE implementing the feature - Verify test fails (since
ORI_LINKERsupport does not exist yet) - Implement
ORI_LINKERsupport - Verify test passes unchanged
- Write a Rust unit test in
-
Matrix: The linker change is not type-dependent but path-dependent. Test matrix:
ORI_LINKERunset: default linker selected (existing behavior preserved)ORI_LINKER=lld:lldlinker selectedORI_LINKER=/usr/lib/llvm-21/bin/ld.lld: explicit path acceptedORI_LINKER=invalid: graceful error (not a panic)- All ~1,950 AOT tests pass identically with each valid linker. No behavioral changes.
-
Semantic pin: The
ORI_LINKER=lldunit test ONLY passes with the new env var support — reverting the change makes it fail. -
Debug and release:
timeout 150 cargo test -p ori_llvm --test aotpasses in both debug and release builds. -
Measurement: Compare link-phase timing before and after. Record in this section.
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (04.1) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-04.1 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 04.1: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
04.2 Compilation Pipeline Optimization
File(s): compiler/oric/src/commands/build/ (build command — mod.rs, single.rs, multi.rs), compiler/oric/src/commands/codegen_pipeline.rs (compilation pipeline), compiler/ori_llvm/src/aot/ (AOT pipeline), compiler/ori_llvm/tests/aot/util/aot.rs (test harness)
Each AOT test spawns ori build as a separate subprocess via Command::new(ori_binary()). The full compilation pipeline (lex→parse→typeck→ARC→LLVM→object→link) runs inside that subprocess. This means:
-
Each test starts a fresh process with fresh LLVM Context, fresh Salsa DB, etc.
-
There is no cross-test caching or context reuse
-
Optimizations must target either (a) the
ori buildpipeline itself, (b) the subprocess overhead, or (c) restructuring to avoid per-test subprocesses -
Check Section 03’s per-phase timing to determine which compilation phase dominates.
-
Shared runtime pre-compilation: The
ori_rtruntime library (libori_rt.a) is linked into every AOT binary. It is a pre-built static library discovered byori_llvm/src/aot/runtime.rs(checked at<exe>/../lib/libori_rt.aor$ORI_WORKSPACE_DIR/target/). Verify it is NOT being rebuilt per test — the linker just reads it. If the linker re-readslibori_rt.afrom disk for each of ~1,950 tests, the I/O overhead adds up. Check if the OS page cache handles this effectively. -
LLVM optimization level for tests: Verify
ori builddefaults to-O0. Checked:compiler/oric/src/commands/build_options/mod.rsline 88 setsopt_level: OptLevel::O0as the default. The--releaseflag appliesO2. AOT tests callori buildwithout--release, so they already use-O0. This optimization is already in place — skip unless profiling reveals LLVM optimization passes are still a bottleneck. -
LLVM Context creation overhead: Each
ori buildinvocation creates a fresh LLVM Context. This happens ~1,950 times. Context creation involves LLVM target initialization. This is NOT optimizable within the current subprocess architecture — it would require an in-process compilation mode (see 04.3 batch test execution). -
Object file writing + linking: The current pipeline writes object files to disk (in the TempDir), then invokes the system linker. Both are per-test I/O operations:
grep -r "write_to_file\|write_bitcode\|object_file\|emit_object" compiler/ori_llvm/src/aot/The TempDir is on
/tmpwhich is ext4 on this WSL2 system (NOT tmpfs). This means object file writes hit the real filesystem. Consider mounting a tmpfs at a custom location and settingTMPDIRfor AOT tests to reduce I/O overhead:# Option: mount tmpfs for test builds sudo mount -t tmpfs -o size=512m tmpfs /tmp/ori-test-builds TMPDIR=/tmp/ori-test-builds cargo test -p ori_llvm --test aot -
Salsa query caching: NOT applicable for AOT tests — each test spawns a separate
ori buildprocess with a fresh Salsa DB. There is no cross-test Salsa caching. This is a fundamental limitation of the subprocess architecture. The only way to get Salsa caching benefits would be an in-process compilation mode or a persistent compiler server. -
Measure the impact of each optimization individually (not combined) to understand the contribution of each.
Test Strategy
-
TDD ordering: These are optimization changes, not bug fixes. The “test” is that all existing tests continue passing with identical results. Before each optimization:
- Record the full AOT test output (pass/fail/stdout per test) as a before-snapshot
- Apply the optimization
- Verify the after-output is identical to the before-snapshot
-
Matrix: All ~1,950 AOT tests pass identically after each optimization. No test code modified.
-
Semantic pin: The before/after output comparison IS the semantic pin — any output difference indicates a behavioral change, not just a performance change.
-
Debug and release:
timeout 150 cargo t(debug) passes after each change. Release build (timeout 150 cargo t --release) verified at end of subsection. -
Measurement: Record per-phase timing before and after each optimization.
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (04.2) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-04.2 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 04.2: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
04.3 Process Overhead Reduction
File(s): compiler/ori_llvm/tests/aot/
Each AOT test spawns TWO child processes: (1) ori build for compilation and (2) the compiled binary for execution. Process creation (fork+exec) and teardown has overhead that multiplies across ~1,950 tests (~3,900 total process spawns).
-
Check Section 03’s per-phase timing for the “execute” and “overhead” components.
-
Temp file management: Each test creates a new
TempDir(via thetempfilecrate) with a unique source file and binary:// From compiler/ori_llvm/tests/aot/util/aot.rs:compile_and_run_capture() (line 149) let temp_dir = TempDir::new().expect("Failed to create temp dir"); let source_path = temp_dir.path().join(format!("test_{id}.ori")); let binary_path = temp_dir.path().join(format!("test_{id}{}", std::env::consts::EXE_SUFFIX));The filesystem overhead (mkdir + write source + write object + write binary + unlink all) adds up over ~1,950 tests. Consider:
- Reusing a single temp directory across tests (with unique filenames via the
AtomicU64counter already in place) - Verifying
/tmpistmpfson the target system (reduces disk I/O)
- Reusing a single temp directory across tests (with unique filenames via the
-
Binary execution overhead: Each test runs a compiled binary with
ORI_CHECK_LEAKS=1always enabled (hardcoded incompile_and_run_capture()at line 178, and also incompile_and_run_with_args()at line 221):let run_result = Command::new(&binary_path) .env("ORI_CHECK_LEAKS", "1") .output()Leak detection adds per-allocation tracking overhead to the runtime. Consider:
- Making leak detection opt-in for regular test runs via an env var (e.g., only enable when
ORI_AOT_CHECK_LEAKS=1is set) - Keeping it always-on for CI
- Measuring the overhead: run the same tests with and without
ORI_CHECK_LEAKS=1and compare
- Making leak detection opt-in for regular test runs via an env var (e.g., only enable when
-
Parallel AOT tests: The AOT tests already run in parallel by default. They are standard
#[test]functions with no#[serial]annotation and no shared mutable state (each test creates its ownTempDirand spawns independent subprocesses). Rust’s test framework runs them in parallel threads. However:- The degree of parallelism is controlled by
--test-threads=N(default: number of CPU cores) - Each test spawns 2 subprocesses, so with N parallel tests, there are up to 2N concurrent processes
- Investigate whether the current parallelism level is optimal or if system I/O saturation limits gains
- There is NO LLVM Context thread-safety concern — the LLVM Context lives inside the
ori buildsubprocess, not the test process
- The degree of parallelism is controlled by
-
Batch test execution: Instead of one
ori buildinvocation per test, group multiple small tests into a single compilation unit. Concrete approach:- Prototype with 10 tests: Pick 10 independent AOT tests. Concatenate their Ori source into a single
.orifile with distinct@main-like functions selected by a command-line argument. Measure compile+link+run time for the batch vs 10 individual runs. - Measure overhead split: From the prototype, determine what fraction of per-test cost is fixed overhead (process spawn, LLVM Context init,
ori_rtlinkage, temp dir creation) vs variable (source-proportional compilation). If fixed overhead is >60% of per-test time, batching will have significant ROI. - Implement batching if profitable: If the prototype shows >2x speedup for the batch, implement a
BatchedAotRunnerincompiler/ori_llvm/tests/aot/util/that:- Groups tests by expected-success vs expected-failure (failures cannot share a compilation unit)
- Generates a single
.orisource with all test bodies as separate functions - Compiles once, then runs the binary once per test function (or with a dispatch argument)
- Falls back to individual execution for any test that fails compilation in batch mode
- Failure isolation: If a batch compilation fails, re-run each test individually to identify the failing test. This preserves test granularity for error reporting.
- Fallback alternative: If batching is insufficient (target still not met after steps 1-4), implement a persistent
ori build --servermode that accepts multiple compilation requests without process restart. This amortizes LLVM Context creation, Salsa DB initialization, andori_rtlinkage across many tests. Only pursue if batching alone is insufficient and profiling confirms LLVM Context creation is a major cost.
- Prototype with 10 tests: Pick 10 independent AOT tests. Concatenate their Ori source into a single
-
Hygiene: split
compiler/ori_llvm/tests/aot/util/aot.rs(569 lines, exceeds 500-line limit). Extract IR inspection helpers (extract_function_ir,count_bridge_blocks,count_single_pred_phis,count_dead_phis,is_ssa_var_used_in,is_bridge_only) into a siblingcompiler/ori_llvm/tests/aot/util/ir_inspect.rs. The compile-and-run functions stay inaot.rs. Re-export fromutil/mod.rs. This bringsaot.rsto ~385 lines andir_inspect.rsto ~185 lines.
Test Strategy
-
TDD ordering: For the
aot.rssplit (hygiene task), write a list of all public functions currently exported fromutil/aot.rs, then verify each is still accessible after extraction:- Before split:
cargo test -p ori_llvm --test aot --no-runcompiles successfully - After split: same command compiles successfully, proving re-exports are correct
- All ~1,950 AOT tests produce identical results
- Before split:
-
Matrix: All ~1,950 AOT tests produce identical results after each change in this subsection (temp dir reuse, leak detection opt-in, batch execution).
-
Semantic pin: No test passes that previously failed, no test fails that previously passed. If
ORI_CHECK_LEAKSis made opt-in, a specific test must verify thatORI_AOT_CHECK_LEAKS=1still enables leak detection. -
Debug and release:
timeout 150 cargo t(debug) ANDtimeout 150 cargo t --release(release) must pass after all changes. -
Measurement: Per-test overhead (ms/test) before and after. Target: reduce from ~18ms/test to <8ms/test.
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (04.3) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-04.3 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 04.3: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
04.R Third Party Review Findings
- None.
04.4 Completion Checklist
- Linker optimization evaluated and applied if beneficial (>10% improvement)
- Compilation pipeline optimizations applied based on per-phase profiling
- Process overhead reduced (temp files, leak detection, parallelism evaluated)
- Batch test execution prototype measured (10-test batch vs 10 individual runs)
-
compiler/ori_llvm/tests/aot/util/aot.rssplit: IR inspection helpers extracted toir_inspect.rs(569→~385 lines) - AOT test execution time measured: ??? (target: <=15s)
- All ~1,950 AOT tests pass identically (no behavioral changes)
- Optimizations documented (what was changed, why, measured impact)
-
timeout 150 cargo tpasses with all tests green -
/tpr-reviewpassed — independent Codex review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — implementation hygiene review clean (phase boundaries, SSOT, algorithmic DRY, naming). MUST run AFTER/tpr-reviewis clean. -
/improve-toolingretrospective completed — MANDATORY at section close, after both reviews are clean. Reflect on the section’s debugging journey (whichdiagnostics/scripts you ran, which command sequences you repeated, where you added ad-hocdbg!/tracingcalls, where output was hard to interpret) and identify any tool/log/diagnostic improvement that would have made this section materially easier OR that would help the next section touching this area. Implement every accepted improvement NOW (zero deferral) and commit each via SEPARATE/commit-push. The retrospective is mandatory even when nothing felt painful — that is exactly when blind spots accumulate. See.claude/skills/improve-tooling/SKILL.md“Retrospective Mode” for the full protocol.
Exit Criteria: ORI_TEST_TIMING=1 cargo test -p ori_llvm --test aot reports total time ≤15s. All ~1,950 tests pass. No test code was modified. The link-phase, compile-phase, and execute-phase timings are all recorded and show measurable improvement from baseline.