Section 09: ARC Header Compression
Context: The current runtime (ori_rt) unconditionally uses i64 for the reference count, with MAX_REFCOUNT = i64::MAX. This is maximally safe but wasteful. The ARC pipeline already computes borrow/ownership information — extending this to bound the maximum refcount is a natural next step.
The challenge: RC header width must be a compile-time decision, but refcount values are runtime quantities. We need static analysis that proves an upper bound on the dynamic refcount.
Reference implementations:
- Swift
stdlib/public/SwiftShims/RefCount.h: Encodes refcount + flags in a single 64-bit word using bitfields. Stores strong count, unowned count, and flags (immutable, immortal, deallocating) in one word. - Lean4
src/runtime/object.h: Uses 32-bit RC + tag in a single word. RC overflow bumps to “immortal” (never freed). - CPython: Uses
Py_ssize_t(platform word size) — simple but wastes memory.
Depends on: §02 (triviality — trivial values need no header), §08 (escape analysis — non-escaping values need no header).
09.1 Sharing Bound Analysis
File(s): compiler/ori_repr/src/arc_opt/sharing.rs
Compute an upper bound on the number of simultaneous references to each allocation.
-
Define sharing bound:
#[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum SharingBound { /// Exactly one reference at all times (unique ownership) Unique, /// At most N references (from call graph analysis) Bounded(u32), /// Unknown — could be arbitrarily many Unbounded, } -
Implement per-allocation sharing analysis:
pub fn compute_sharing_bound( alloc: AllocId, arc_func: &ArcFunction, escape_info: &EscapeInfo, loop_info: &LoopInfo, ) -> SharingBound { // If value doesn't escape → Unique (refcount always 1) if escape_info.escape_state(alloc) == EscapeState::NoEscape { return SharingBound::Unique; } // Collect all rc_inc instructions for this allocation let incs = arc_func.rc_incs_for(alloc); let decs = arc_func.rc_decs_for(alloc); // SOUNDNESS: Static instruction count ≠ dynamic execution count. // A single rc_inc inside a loop body can execute N times, creating // up to N simultaneous references. We must bail to Unbounded if // any rc_inc is inside a loop or reachable via recursion. let any_inc_in_loop = incs.iter().any(|inc| { loop_info.is_in_loop(inc.block) }); if any_inc_in_loop { // Future: if §03 range analysis can bound the loop iteration // count, we could use SharingBound::Bounded(inc_count * max_iters + 1). // For now, conservatively bail. return SharingBound::Unbounded; } // Similarly, if this function is (mutually) recursive and the // allocation flows into the recursive call, the static count // is not a sound bound. if arc_func.is_recursive() && arc_func.alloc_flows_to_self_call(alloc) { return SharingBound::Unbounded; } // SOUNDNESS: GlobalEscape means outside code can see this value // (returned, stored in global, captured in escaping closure) and may // create arbitrarily many additional references. Only ArgEscape // (borrowed by callees, never retained) allows a bounded count. if escape_info.escape_state(alloc) == EscapeState::GlobalEscape { return SharingBound::Unbounded; } // ArgEscape: value is passed to callees as borrowed only. // All rc_inc/dec are in straight-line code (no loops, no recursion). // If all inc/dec pairs are within the function → bounded. debug_assert_eq!( escape_info.escape_state(alloc), EscapeState::ArgEscape, "only ArgEscape should reach here" ); let inc_count = incs.len(); let dec_count = decs.len(); if inc_count == dec_count { // All sharing is temporary → max refs = inc_count + 1 (initial) return SharingBound::Bounded(inc_count as u32 + 1); } // Callee retains some references we can't bound SharingBound::Unbounded } -
Interprocedural refinement:
- If a function’s escape summary says a parameter is
ArgEscape(borrowed), it contributes 0 to the sharing count - If a value is passed to N functions as borrowed, sharing bound is still 1
- If a value is cloned into a collection, sharing bound increases by collection capacity
- If a function’s escape summary says a parameter is
09.2 RC Width Selection
File(s): compiler/ori_repr/src/arc_opt/rc_width.rs
Map sharing bounds to RC header widths.
-
Implement width selection:
pub fn select_rc_width(bound: SharingBound) -> Option<IntWidth> { match bound { SharingBound::Unique => None, // no RC header at all! SharingBound::Bounded(n) if n <= 127 => Some(IntWidth::I8), SharingBound::Bounded(n) if n <= 32_767 => Some(IntWidth::I16), SharingBound::Bounded(n) if n <= 2_147_483_647 => Some(IntWidth::I32), _ => Some(IntWidth::I64), } } -
Overflow behavior:
- If at runtime the refcount exceeds the header width → must NOT silently overflow
- Options:
- (a) Trap (recommended for debug builds): panic on overflow
- (b) Promote to immortal (Lean4 approach): set refcount to MAX → never freed (leaked)
- (c) Widen header (complex): realloc with wider header at runtime
- Recommendation: (a) for debug, (b) for release. The analysis proving the bound is sound, so overflow should never happen in correct programs. The trap catches bugs; the immortal fallback prevents crashes.
-
Generate per-width runtime functions (ABI matches existing i64 functions):
// ori_rt additions — same 2-arg dec signature as ori_rc_dec: extern "C" fn ori_rc_alloc_i8(size: usize, align: usize) -> *mut u8; extern "C" fn ori_rc_inc_i8(data_ptr: *mut u8); extern "C" fn ori_rc_dec_i8(data_ptr: *mut u8, drop_fn: Option<...>); extern "C" fn ori_rc_free_i8(data_ptr: *mut u8, size: usize, align: usize); // Similar for i16, i32. // drop_fn calls ori_rc_free_$suffix (generated by DropFunctionGenerator).
09.3 Runtime Polymorphism
File(s): compiler/ori_rt/src/rc/narrow.rs (new file inside rc/ module)
The runtime must support multiple header widths without code bloat.
Module placement: The width-specific functions MUST live inside rc/ (e.g., rc/narrow.rs with mod narrow; in rc/mod.rs). This is required because they call call_drop_fn and rc_underflow_abort, which are pub(super) — visible within rc/ but not from lib.rs or other modules. Tests go in rc/narrow/tests.rs (sibling convention) if the file becomes a directory module, or in rc/tests.rs if narrow.rs stays as a leaf file and tests are co-located with the existing rc/ test module.
Risk warning: The macro-generated RC operations below use raw pointer arithmetic and unsafe. Every unsafe block MUST have a // SAFETY: comment. The padded_header alignment logic is subtle — a bug causes data corruption in EVERY narrow-header allocation. Property-based testing with varying (size, align) pairs is essential. Note: ori_rt is a crate where unsafe IS allowed, so #![deny(unsafe_code)] does NOT apply here.
-
Implement generic RC operations parameterized by header width:
// Macro-generated for each width macro_rules! rc_ops { ($width:ty, $suffix:ident, $max:expr) => { #[no_mangle] pub unsafe extern "C" fn paste!(ori_rc_alloc_, $suffix)( size: usize, align: usize ) -> *mut u8 { let header_size = std::mem::size_of::<$width>(); // Pad header up to payload alignment so that // base + padded_header is correctly aligned for the payload. // Without this, narrow headers (i8/i16) would misalign // payloads requiring 4- or 8-byte alignment. let padded_header = header_size.max(align); let total = size + padded_header; let layout = Layout::from_size_align( total, align.max(std::mem::align_of::<$width>()), ).unwrap(); let base = alloc(layout); let data_ptr = base.add(padded_header); // Store refcount immediately before the payload. // This is where rc_inc/rc_dec will find it via .sub(1). let rc_ptr = (data_ptr as *mut $width).sub(1); *rc_ptr = 1; data_ptr } #[no_mangle] pub unsafe extern "C" fn paste!(ori_rc_inc_, $suffix)(data_ptr: *mut u8) { if data_ptr.is_null() { return; } let rc_ptr = (data_ptr as *mut $width).sub(1); let old = *rc_ptr; if old >= $max { // Promote to immortal (never freed) return; } *rc_ptr = old + 1; } /// Decrement refcount. On zero, call drop_fn which is responsible /// for decrementing child RC fields AND calling ori_rc_free_$suffix. /// /// CONTRACT (matches existing ori_rc_dec): /// drop_fn does: (1) rc_dec each RC'd child field /// (2) ori_rc_free_$suffix(data_ptr, size, align) /// rc_dec does NOT call free — that would be a double-free. #[no_mangle] pub unsafe extern "C" fn paste!(ori_rc_dec_, $suffix)( data_ptr: *mut u8, drop_fn: Option<extern "C" fn(*mut u8)>, ) { if data_ptr.is_null() { return; } let rc_ptr = (data_ptr as *mut $width).sub(1); let old = *rc_ptr; // Underflow protection — matches ori_rc_dec (rc/mod.rs). // NOT gated behind debug: one branch per dec (~0.5ns, always // not-taken). Catches double-free bugs before they corrupt memory. if old <= 0 { rc_underflow_abort(data_ptr); } if old >= $max { return; // immortal — never freed } let new = old - 1; *rc_ptr = new; if new == 0 { if let Some(f) = drop_fn { // Route through call_drop_fn (rc/mod.rs) — a // catch_unwind wrapper that aborts on panic. // ori_rc_dec_$suffix is declared nounwind in LLVM // IR, so unwinding through it is UB. This guard // matches the existing ori_rc_dec contract. call_drop_fn(f, data_ptr); } // drop_fn calls ori_rc_free_$suffix internally. // If drop_fn is None, the memory leaks — this should // not happen in well-formed programs (every RC type // must have a drop function). } } /// Free an RC allocation. Called by drop functions (generated by /// DropFunctionGenerator), NOT by ori_rc_dec directly. /// /// Reconstructs the original allocation base by reversing the /// padded_header computation from ori_rc_alloc_$suffix. #[no_mangle] pub unsafe extern "C" fn paste!(ori_rc_free_, $suffix)( data_ptr: *mut u8, size: usize, align: usize, ) { let header_size = std::mem::size_of::<$width>(); let padded_header = header_size.max(align); let base = data_ptr.sub(padded_header); let total = size + padded_header; let layout = Layout::from_size_align_unchecked( total, align.max(std::mem::align_of::<$width>()), ); dealloc(base, layout); } }; } rc_ops!(i8, i8, i8::MAX); rc_ops!(i16, i16, i16::MAX); rc_ops!(i32, i32, i32::MAX); // i64 is the existing implementation -
Atomic variants:
- For thread-shared values (§10 determines this), use atomic operations
- For thread-local values, use plain loads/stores (much faster)
09.4 Completion Checklist
- Sharing bound analysis computes
Uniquefor non-escaping allocations - Sharing bound analysis computes
Bounded(N)for values with limited sharing - RC header width matches sharing bound: Unique→none, ≤127→i8, ≤32K→i16
- Runtime has
ori_rc_alloc_i8,ori_rc_inc_i8,ori_rc_dec_i8(and i16, i32) - Header overflow in release mode → immortal (never freed, no crash)
- Header overflow in debug mode → trap with diagnostic
-
./test-all.shgreen -
./clippy-all.shgreen -
./diagnostics/valgrind-aot.shclean - Memory measurement: per-object overhead reduced from 8 bytes to 1-4 bytes for bounded types
Exit Criteria: A program that creates 1M small heap objects with refcount ≤ 2 uses ori_rc_alloc_i8 (1-byte header), verified by grep "ori_rc_alloc_i8" in LLVM IR. Total memory reduced by ~7MB vs. current implementation. Valgrind clean.