Section 03: Aggregate Value Emission
Status: Not Started Goal: All fat pointer value copies use aggregate operations (2 instructions: load + store) instead of field-by-field decomposition (10 instructions: 3 GEP + 3 load + 3 insertvalue + 1 store). This applies to ALL fat pointer operations across the entire compiler, not just the journey scenarios.
Context: J16 discovered that every str operation (passing to functions, returning, binding) emits a 10-instruction field-by-field copy sequence. The ideal is 2 instructions: one aggregate load and one aggregate store. This bloat affects every program that uses strings, lists, maps, closures, or any other fat pointer type. J14 also found duplicate ptrtoint operations in the SSO guard sequence, and redundant unconditional branches in string function CFGs.
Reference implementations:
- LLVM
docs/LangRef.rst:load {i64, i64, ptr}, ptr %srcis a single instruction that loads the entire aggregate - Rust
compiler/rustc_codegen_llvm/src/abi.rs: UsesOperandValue::Immediatefor small aggregates,Reffor large ones — aggregate loads for by-value struct passing
03.1 Aggregate Load/Store for Fat Pointers
File(s): compiler/ori_llvm/src/codegen/arc_emitter/value_emission.rs, compiler/ori_llvm/src/codegen/arc_emitter/apply_helpers.rs
Currently, passing a str value emits:
; ACTUAL: 10 instructions
%p0 = getelementptr inbounds {i64, i64, ptr}, ptr %src, i32 0, i32 0
%f0 = load i64, ptr %p0
%p1 = getelementptr inbounds {i64, i64, ptr}, ptr %src, i32 0, i32 1
%f1 = load i64, ptr %p1
%p2 = getelementptr inbounds {i64, i64, ptr}, ptr %src, i32 0, i32 2
%f2 = load ptr, ptr %p2
%v0 = insertvalue {i64, i64, ptr} undef, i64 %f0, 0
%v1 = insertvalue {i64, i64, ptr} %v0, i64 %f1, 1
%v2 = insertvalue {i64, i64, ptr} %v1, ptr %f2, 2
store {i64, i64, ptr} %v2, ptr %dst
The ideal:
; IDEAL: 2 instructions
%v = load {i64, i64, ptr}, ptr %src
store {i64, i64, ptr} %v, ptr %dst
Note on JIT safety: The CLAUDE.md key rule says “never load %BigStruct, ptr for >16B in JIT — use per-field GEP+load+insert_value.” This applies to JIT (FastISel) mode only. For AOT compilation (which uses the full LLVM backend), aggregate loads are safe and preferred. The fix should gate on JIT vs AOT mode.
- Identify all callsites that emit field-by-field copy sequences — found in
load_struct_selective()inmemory.rs(NOT invalue_emission.rsorapply_helpers.rsas originally suspected) (2026-03-18) - Replace with aggregate
load+storefor AOT mode —load_struct_selective()now delegates toself.load()which handles JIT/AOT mode correctly (2026-03-18) - Apply to all fat pointer types:
str,[T], closures, maps/sets — the fix is inload_struct_selective()which is the shared path for all param loading (2026-03-18) - Verify the fix applies when passing fat pointers as function arguments — confirmed via J16 IR:
@_ori_get_lenand@_ori_longeruse aggregate loads (2026-03-18) - Measure instruction count reduction on J14 and J16 — J16
@_ori_get_len: 13→5 instructions,@_ori_longer: 27→11 instructions (2026-03-18) - Implement direct pointer forwarding for borrowed parameters: when a function receives
ptr readonly dereferenceable(24)and calls a runtime function that also takesptr(e.g.,ori_str_len), forward the parameter pointer directly instead of copying to a local alloca. Implemented viaborrowed_param_ptrsmap inArcIrEmitterwith Let-alias propagation. Applies to: user-to-user calls (apply_param_passing_with_forwarding), user-to-runtime calls (aggregate coercion bypass), andstr.len()builtin (str_to_ptr_forwarded). Also fixed pre-existing double-free intest_matrix_nested_list_two_calls. (2026-03-18) - Implement sret forwarding: when
ori_str_from_rawwrites to an sret alloca and the result is immediately stored to another sret ptr (e.g.,@make_string), pass the final destination directly toori_str_from_raw. Implemented viacurrent_sret_ptrinArcIrEmitterwith take-semantics (first call_with_sret consumes it). Also fixed string literal emission to route through emitter’scall_with_sretinstead of builder’s.@_ori_make_stringnow 3 instructions (was 4: alloca+call+load+store → call+load+store, dead load/store eliminated by LLVM DCE/DSE). (2026-03-18) - Gate the JIT vs AOT mode check — already existed:
CompilationMode::Jit/AotinIrBuilder,load()already gates. Fix uses this viaself.load()delegation (2026-03-18)
03.2 Deduplicate ptrtoint in SSO Guard
File(s): compiler/ori_llvm/src/codegen/arc_emitter/rc_buffer_ops.rs
J14 found that each SSO guard (the bit 63 check for inline strings) converts the same pointer to integer twice:
; ACTUAL: 2 conversions for the same pointer
%rc_dec.p2i = ptrtoint ptr %data to i64 ; first conversion
%rc_dec.sso = and i64 %rc_dec.p2i, -9223372036854775808
%rc_dec.is_sso = icmp ne i64 %rc_dec.sso, 0
...
%rc_dec.null.p2i = ptrtoint ptr %data to i64 ; DUPLICATE
%rc_dec.is_null = icmp eq i64 %rc_dec.null.p2i, 0
The ideal: one ptrtoint, reuse the result for both SSO check and null check.
Root cause: emit_sso_check calls ptr_to_int at line ~267, then calls is_null_ptr at line ~279 which internally calls ptr_to_int again via comparisons.rs:102. The fix is to reuse the first ptr_int value for the null check via icmp eq i64 %ptr_int, 0.
- Modify
emit_sso_checkinrc_buffer_ops.rsto reuse theptrtointresult — already implemented: singleptr_to_intreused for both SSO flag and null check. Comment: “Reuse ptr_int for null check (avoids duplicate ptrtoint)” (pre-existing fix, verified 2026-03-18) - Verify the fix applies to all fat pointer RC operations — confirmed via J14 IR: 0 duplicate
ptrtointacross all SSO guard sites (2026-03-18)
03.3 Single-Predecessor Block Merging for SSO Paths
File(s): compiler/ori_llvm/src/codegen/ir_builder/cfg_simplify/mod.rs
J14 found redundant unconditional branches (br label %bb1 at end of bb0) in @sso_len and @heap_len. Block bb1 has a single predecessor (bb0), so the two blocks should be merged into one. This is a block merging issue (single-predecessor successor), not an empty block issue. The existing cfg_simplify pass performs entry block merging (added in commit d2c9a929) but may not handle the general single-predecessor case.
- Verify the CFG simplification pass runs after SSO guard emission — confirmed:
simplify_cfg()runs at function verification time, after all emission (2026-03-18) - Check whether
merge_entry_blocks()handles the general single-predecessor case — confirmed: it only handled entry blocks. General case was missing (2026-03-18) - Confirmed
bb1was not an entry block —merge_entry_block()only checkedblocks[0](2026-03-18) - Implement general single-predecessor successor merging — added
merge_single_predecessor_blocks()with fixed-point iteration incfg_simplify/mod.rs. UsesLLVMInstructionRemoveFromParent+LLVMInsertIntoBuilderto move instructions, then updates phi incoming blocks (2026-03-18) - Verify no redundant unconditional branches remain — confirmed via J16 IR:
@_ori_check_passand@_ori_check_returnhave bb0→bb1 merged (2026-03-18)
Cleanup
- [WASTE]
cfg_simplify/mod.rs— Replacedstd::collections::HashMapwithrustc_hash::FxHashMap(2026-03-18)
03.4 Dead Unwind Elimination for nounwind Callees
File(s): compiler/ori_llvm/src/codegen/arc_emitter/dead_unwind.rs, compiler/ori_llvm/src/codegen/arc_emitter/terminators.rs
J16 found that @check_pass invokes @_ori_get_len (which is nounwind) via invoke instead of call, generating ~12 instructions of dead landing pad code. The same pattern appears in @check_multi’s invoke to @_ori_longer.
Codebase note: terminators.rs:230 already implements InvokeMode::Call when is_nounwind is true. The issue is likely that the callee is not in ctx.nounwind_functions — the nounwind analysis may not detect user-defined Ori functions as nounwind (it may only cover runtime functions). Check ctx.nounwind_functions population in codegen/function_compiler/ or the ARC pipeline.
- Verify
dead_unwind.rsruns after nounwind analysis — confirmed:detect_dead_unwind_blockschecksctx.nounwind_functionswhich is populated bycompute_nounwind_setbefore emission (2026-03-18) - Determine why user-defined Ori functions not in nounwind set — ROOT CAUSE:
is_arc_function_nounwind()indefine_phase.rsdidn’t recognize builtin method calls (e.g.Apply @length) as nounwind.is_rt_fn_nounwind("length")returnsNone, andlengthis not innounwind_functions. Fixed by addingis_callee_intercepted()check that mirrorscallee_will_be_interceptedlogic: format calls, prelude functions, and builtin methods on builtin types are all intercepted → always emitcall→ effectively nounwind (2026-03-18) - Fix the invoke emission path — the
emit_invokeinterminators.rswas already correct; the fix was in the pre-analysisis_arc_function_nounwind(). Nowget_lenis correctly identified as nounwind during the two-pass analysis, socheck_pass’s invoke toget_lenis downgraded tocall(2026-03-18) - Test:
@check_passusescall(notinvoke) to call@_ori_get_len— confirmed via J16 IR + regression testtest_nounwind_callee_uses_call(2026-03-18) - Verify no dead landing pads for nounwind callees — confirmed:
@_ori_check_passhas nopersonality, nolandingpad, noresume(2026-03-18)
03.R Third Party Review Findings
- None.
03.N Completion Checklist
-
strpassing uses aggregate load/store (1 load instruction, not 9 GEP+load+insertvalue) — verified via J16@_ori_get_lenIR (2026-03-18) -
[T]passing uses aggregate load/store — same codepath as str (load_struct_selective → load) (2026-03-18) - Closure passing uses aggregate load/store — same codepath (2026-03-18)
- Borrowed parameter forwarding:
@get_len(ptr readonly)forwards ptr directly toori_str_len(ptr)without copying to local alloca — verified:@_ori_str_len_wrapperand@_ori_check_passforward%0directly (2026-03-18) - Sret forwarding:
@make_stringpasses sret ptr directly toori_str_from_rawwithout intermediate alloca — verified via J16 IR:call void @ori_str_from_raw(ptr %0, ...)(2026-03-18) - SSO guard emits a single
ptrtointper guard (not duplicate) — pre-existing fix, verified via regression testtest_sso_guard_single_ptrtoint(2026-03-18) - No redundant unconditional branches between single-predecessor blocks — verified via J16
@_ori_check_passand@_ori_check_returnIR + regression testtest_single_predecessor_block_merged(2026-03-18) - JIT mode still works (field-by-field fallback) — JIT guard in
IrBuilder::load()at line 74 unchanged;load_struct_selectivedelegates toload()which checks mode (2026-03-18) - No dead landing pads for nounwind callees (J16 LOW-2) — fixed via
is_callee_intercepted()in nounwind pre-analysis + regression testtest_nounwind_callee_uses_call(2026-03-18) -
./test-all.shgreen (debug) — 12,972 tests pass, 0 failures (2026-03-18) -
./clippy-all.shgreen (2026-03-18) - J14 re-run: rescore with extract-metrics.py — 10.0/10 (up from 9.4): instruction efficiency OPTIMAL (1.00x), 0 CF defects, 0 unjustified instructions,
@_ori_shared_lennow 3 instructions (2026-03-18) - J16 re-run: rescore with extract-metrics.py — 9.9/10 (up from 9.4): instruction efficiency OPTIMAL (1.00x), 0 CF defects, 0 unjustified instructions, 33/34 attributes (97.1%). Remaining 0.1 due to sret intermediate copy in
@_ori_make_string(2026-03-18)
Exit Criteria: python3 .claude/skills/code-journey/extract-metrics.py on J14 and J16 IR reports 0 unjustified instructions AND 0 CF defects AND ./test-all.sh passes in both debug and release.