Section 09: ARC Header Compression
Context: The current runtime (ori_rt) unconditionally uses i64 for the reference count, with MAX_REFCOUNT = i64::MAX. The ARC pipeline already computes borrow/ownership information, so specializing RC count handling is a plausible next step. This section now explicitly commits to the smaller-header path: changing only the width of strong_count is insufficient unless the surrounding V5 layout is redesigned so bounded-count variants actually occupy fewer than 32 bytes.
Warning (V5 header, 2026-03-22): The RC header is now 32 bytes with 4 fields:
data_size(i64),elem_dec_fn(ptr),elem_count(i64),strong_count(i64). This plan was drafted when the header had only 2 fields (data_size + strong_count). Any narrowing strategy must account for all 4 fields, not just the refcount. Therc_ops!macro code in Section 09.3 does not account forelem_dec_fnorelem_countand must be updated before implementation.
The challenge: RC header width must be a compile-time decision, but refcount values are runtime quantities. We need static analysis that proves an upper bound on the dynamic refcount.
Reference implementations:
- Swift
stdlib/public/SwiftShims/RefCount.h: Encodes refcount + flags in a single 64-bit word using bitfields. Stores strong count, unowned count, and flags (immutable, immortal, deallocating) in one word. - Lean4
src/runtime/object.h: Uses 32-bit RC + tag in a single word. RC overflow bumps to “immortal” (never freed). - CPython: Uses
Py_ssize_t(platform word size) — simple but wastes memory.
Depends on: §02 (triviality — trivial values need no header), §08 (escape analysis — non-escaping values need no header).
09.1 Sharing Bound Analysis
File(s): compiler/ori_repr/src/arc_opt/sharing.rs
Compute an upper bound on the number of simultaneous references to each allocation.
-
Define sharing bound:
#[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum SharingBound { /// Exactly one reference at all times (unique ownership) Unique, /// At most N references (from call graph analysis) Bounded(u32), /// Unknown — could be arbitrarily many Unbounded, } -
Implement per-allocation sharing analysis:
pub fn compute_sharing_bound( alloc: AllocId, arc_func: &ArcFunction, escape_info: &EscapeInfo, loop_info: &LoopInfo, ) -> SharingBound { // If value doesn't escape → Unique (refcount always 1) if escape_info.escape_state(alloc) == EscapeState::NoEscape { return SharingBound::Unique; } // Collect all rc_inc instructions for this allocation let incs = arc_func.rc_incs_for(alloc); let decs = arc_func.rc_decs_for(alloc); // SOUNDNESS: Static instruction count ≠ dynamic execution count. // A single rc_inc inside a loop body can execute N times, creating // up to N simultaneous references. We must bail to Unbounded if // any rc_inc is inside a loop or reachable via recursion. let any_inc_in_loop = incs.iter().any(|inc| { loop_info.is_in_loop(inc.block) }); if any_inc_in_loop { // Future: if §03 range analysis can bound the loop iteration // count, we could use SharingBound::Bounded(inc_count * max_iters + 1). // For now, conservatively bail. return SharingBound::Unbounded; } // Similarly, if this function is (mutually) recursive and the // allocation flows into the recursive call, the static count // is not a sound bound. if arc_func.is_recursive() && arc_func.alloc_flows_to_self_call(alloc) { return SharingBound::Unbounded; } // SOUNDNESS: GlobalEscape means outside code can see this value // (returned, stored in global, captured in escaping closure) and may // create arbitrarily many additional references. Only ArgEscape // (borrowed by callees, never retained) allows a bounded count. if escape_info.escape_state(alloc) == EscapeState::GlobalEscape { return SharingBound::Unbounded; } // ArgEscape: value is passed to callees as borrowed only. // All rc_inc/dec are in straight-line code (no loops, no recursion). // If all inc/dec pairs are within the function → bounded. debug_assert_eq!( escape_info.escape_state(alloc), EscapeState::ArgEscape, "only ArgEscape should reach here" ); let inc_count = incs.len(); let dec_count = decs.len(); if inc_count == dec_count { // All sharing is temporary → max refs = inc_count + 1 (initial) return SharingBound::Bounded(inc_count as u32 + 1); } // Callee retains some references we can't bound SharingBound::Unbounded } -
Interprocedural refinement:
- If a function’s escape summary says a parameter is
ArgEscape(borrowed), it contributes 0 to the sharing count - If a value is passed to N functions as borrowed, sharing bound is still 1
- If a value is cloned into a collection, sharing bound increases by collection capacity
- If a function’s escape summary says a parameter is
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (09.1) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-09.1 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 09.1: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
09.2 RC Width Selection
File(s): compiler/ori_repr/src/arc_opt/rc_width.rs
Map sharing bounds to RC header widths.
-
Implement width selection:
pub fn select_rc_width(bound: SharingBound) -> Option<IntWidth> { match bound { SharingBound::Unique => None, // no RC header at all! SharingBound::Bounded(n) if n <= 127 => Some(IntWidth::I8), SharingBound::Bounded(n) if n <= 32_767 => Some(IntWidth::I16), SharingBound::Bounded(n) if n <= 2_147_483_647 => Some(IntWidth::I32), _ => Some(IntWidth::I64), } } -
Overflow behavior:
- If at runtime the refcount exceeds the header width → must NOT silently overflow
- Options:
- (a) Trap (recommended for debug builds): panic on overflow
- (b) Promote to immortal (Lean4 approach): set refcount to MAX → never freed (leaked)
- (c) Widen header (complex): realloc with wider header at runtime
- Recommendation: (a) for debug, (b) for release. The analysis proving the bound is sound, so overflow should never happen in correct programs. The trap catches bugs; the immortal fallback prevents crashes.
-
Generate per-width runtime functions:
// ori_rt additions for narrow refcount headers. // // CRITICAL: The current V5 header is 32 bytes with 4 fields: // data_size (i64), elem_dec_fn (ptr), elem_count (i64), strong_count (i64) // // Narrowing ONLY applies to strong_count. The other 3 fields // (data_size, elem_dec_fn, elem_count) remain at their canonical widths. // // IMPORTANT: If the header remains padded/aligned to 32 bytes, these // width-specific variants are a throughput/verification specialization, // NOT a memory-footprint optimization. Do not claim per-allocation byte // savings unless §09.3 first adopts a layout that is actually smaller // than the current 32-byte V5 header. extern "C" fn ori_rc_alloc_i8(size: usize, align: usize) -> *mut u8; extern "C" fn ori_rc_inc_i8(data_ptr: *mut u8); extern "C" fn ori_rc_dec_i8(data_ptr: *mut u8, drop_fn: Option<...>); extern "C" fn ori_rc_free_i8(data_ptr: *mut u8, size: usize, align: usize); // Similar for i16, i32. // drop_fn calls ori_rc_free_$suffix (generated by DropFunctionGenerator). -
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (09.2) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-09.2 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 09.2: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
09.3 Runtime Polymorphism
File(s): compiler/ori_rt/src/rc/narrow.rs (new file inside rc/ module)
The runtime must support multiple header widths without code bloat.
Module placement: The width-specific functions MUST live inside rc/ (e.g., rc/narrow.rs with mod narrow; in rc/mod.rs). This is required because they call call_drop_fn and rc_underflow_abort, which are pub(super) — visible within rc/ but not from lib.rs or other modules. Tests go in rc/narrow/tests.rs (sibling convention) if the file becomes a directory module, or in rc/tests.rs if narrow.rs stays as a leaf file and tests are co-located with the existing rc/ test module.
Risk warning: The macro-generated RC operations below use raw pointer arithmetic and unsafe. Every unsafe block MUST have a // SAFETY: comment. The padded_header alignment logic is subtle — a bug causes data corruption in EVERY narrow-header allocation. Property-based testing with varying (size, align) pairs is essential. Note: ori_rt is a crate where unsafe IS allowed, so #![deny(unsafe_code)] does NOT apply here.
-
Document the committed V5 layout strategy BEFORE writing any code:
- This plan chooses Option A — true memory optimization.
- Redesign the header so bounded-count variants are actually smaller than 32 bytes (for example via variable offsets or a packed/bitfield strategy).
- Update
elem_header.rsaccessors and every caller that currently assumes fixed offsets. - The previously proposed “narrow count plus padding back to 32 bytes” layout is NOT an acceptable final design for this section.
- Add compile-time layout assertions in
rc/narrow.rsproving the bounded variants are smaller than the current 32-byte V5 header while preserving payload alignment.
-
Implement narrow RC operations for the V5 header layout:
CRITICAL DESIGN ISSUE: The
rc_ops!macro from the original plan assumes a simple single-field header where the refcount is immediately before the payload (data_ptr.sub(1)). This does NOT match the current V5 header layout:V5 Header (32 bytes): ┌──────────────┬──────────────┬──────────────┬──────────────┐ │ data_size │ elem_dec_fn │ elem_count │ strong_count │ │ (i64) │ (ptr) │ (i64) │ (i64) │ └──────────────┴──────────────┴──────────────┴──────────────┘ ↑ this field narrowsThe narrow-header approach must:
- Keep
data_size,elem_dec_fn,elem_countat their current widths (they are semantically different from refcount) - Only narrow
strong_count(the last field before payload) - The
rc_inc/rc_decfunctions locatestrong_countat a fixed negative offset fromdata_ptr— this offset changes whenstrong_countis narrowed - The V5 header accessor functions in
ori_rt/src/rc/elem_header.rs(store_elem_dec_fn,load_elem_dec_fn, etc.) use hardcoded offsets that must be updated or parameterized
Implementation approach:
- Define the chosen header struct(s) and update all offset accessors consistently for the new smaller layouts.
- Generate
ori_rc_alloc_i32,ori_rc_inc_i32,ori_rc_dec_i32that use the narrow header struct - Update
DropFunctionGeneratorinori_llvmto emit calls to width-specific free functions - Add only the padding required to maintain payload alignment; do not re-expand the bounded variants back to 32 bytes
// Simplified sketch — the real implementation must handle V5 header fields // and prove the resulting layouts are genuinely smaller than 32 bytes. rc_narrow_ops!(i8, i8, i8::MAX); rc_narrow_ops!(i16, i16, i16::MAX); rc_narrow_ops!(i32, i32, i32::MAX); // i64 remains the existing 32-byte V5 implementation. - Keep
-
Atomic variants:
- For thread-shared values (§10 determines this), use atomic operations
- For thread-local values, use plain loads/stores (much faster)
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (09.3) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-09.3 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 09.3: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
09.4 Completion Checklist
Test matrix for §09 (write failing tests FIRST, verify they fail, then implement):
| Allocation pattern | Expected sharing bound | Expected RC width | Semantic pin |
|---|---|---|---|
| Non-escaping stack-promoted value (§08) | Unique | None (no header) | Yes — zero ori_rc_alloc |
| Local value passed as borrowed param, no loop | Bounded(2) | I8 | Yes — ori_rc_alloc_i8 in IR |
| Value in a loop body (inc inside loop) | Unbounded | I64 | Yes — must NOT use narrow |
| Globally-escaping value (returned) | Unbounded | I64 | Yes — must use standard i64 |
| Value shared with ≤ 127 callers in straight-line code | Bounded(N ≤ 127) | I8 | Yes |
| Recursive function sharing its parameter | Unbounded | I64 | Yes |
- Design the V5-narrow header layout document in
compiler/ori_rt/src/rc/narrow.rsBEFORE writing any code:- WHERE: write a
// SAFETY:comment block andstatic_assert!macros as specified in §09.3 - All
unsafeblocks innarrow.rsMUST have// SAFETY:comments per hygiene rules
- WHERE: write a
- Add
static_assert!proving each bounded layout is smaller than the current header (size_of::<V5HeaderI32>() < 32,size_of::<V5HeaderI16>() < 32,size_of::<V5HeaderI8>() < 32) and that payload alignment is preserved - Sharing bound analysis computes
Uniquefor non-escaping allocations - Sharing bound analysis computes
Bounded(N)for values with limited sharing - RC header width matches sharing bound: Unique→none, ≤127→i8, ≤32K→i16
- Runtime has
ori_rc_alloc_i8,ori_rc_inc_i8,ori_rc_dec_i8(and i16, i32) incompiler/ori_rt/src/rc/narrow.rs - Header overflow in release mode → immortal (never freed, no crash)
- Header overflow in debug mode → trap with diagnostic
- Benchmarks and diagnostics report measured per-allocation savings from the real chosen layouts, not from nominal refcount widths
-
./test-all.shgreen -
./clippy-all.shgreen -
./diagnostics/valgrind-aot.shclean - The implementation keeps Option A explicit in code comments, layout assertions, and §12 benchmark expectations
-
/tpr-reviewpassed — independent Codex review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — implementation hygiene review clean (phase boundaries, SSOT, algorithmic DRY, naming). MUST run AFTER/tpr-reviewis clean. -
/improve-toolingretrospective completed — MANDATORY at section close, after both reviews are clean. Reflect on the section’s debugging journey (whichdiagnostics/scripts you ran, which command sequences you repeated, where you added ad-hocdbg!/tracingcalls, where output was hard to interpret) and identify any tool/log/diagnostic improvement that would have made this section materially easier OR that would help the next section touching this area. Implement every accepted improvement NOW (zero deferral) and commit each via SEPARATE/commit-push. The retrospective is mandatory even when nothing felt painful — that is exactly when blind spots accumulate. See.claude/skills/improve-tooling/SKILL.md“Retrospective Mode” for the full protocol.
Exit Criteria: A smaller-than-32-byte header strategy is implemented and verified end-to-end for bounded-count variants. The memory reduction is measured and documented from the real layout chosen in rc/narrow.rs, with payload alignment and drop-path correctness preserved. Valgrind clean.
09.R Third Party Review Findings
-
[TPR-09-001][critical]section-09-arc-header.md:199-206— Design contradicts exit criteria: narrow headers padded to 32 bytes yield zero memory savings. The plan at line 204 states “ALL narrow header variants remain 32 bytes” with padding to maintainelem_header.rsoffset compatibility. Lines 173-175 claim “saves 4/6/7 bytes per allocation” and the exit criteria claims “~7MB savings” — both impossible with 32-byte padded headers. The entire section’s memory optimization premise collapses. Action: Either (a) redesign header layout with variable offsets (breakingelem_header.rshardcoded offsets), (b) use Swift-style bitfield encoding in a single 64-bit word, (c) reframe the section around RC elision forUniqueallocations (which saves the full 32 bytes — already done by §08 stack promotion), or (d) revise exit criteria to reflect actual benefit (narrower atomic operations for throughput, not memory reduction). Consensus: 3/3 reviewers, escalated from critical.