Section 10: Thread-Local Non-Atomic ARC
Context: Currently, ori_rt uses AtomicI64 with Relaxed/Release/Acquire ordering for all RC operations. This is correct for thread-shared values but wasteful for thread-local ones. Most values in most programs never cross thread boundaries — they’re created, used, and freed within a single thread.
Rust solved this by having two types: Rc (non-atomic, thread-local) and Arc (atomic, thread-safe). Ori doesn’t expose this distinction to the programmer — the compiler decides automatically.
Reference implementations:
- Rust
library/alloc/src/rc.rsvslibrary/alloc/src/sync.rs:RcusesCell<usize>(non-atomic),ArcusesAtomicUsize. Programmer chooses. - Swift: All RC is atomic by default, but
isKnownUniquelyReferenced()enables COW without RC overhead. No automatic non-atomic promotion. - CPython: GIL-protected — all RC is effectively non-atomic because only one thread runs at a time.
Depends on: §08 (escape analysis to determine thread-locality), §09 (header compression — non-atomic and atomic headers differ).
10.1 Thread Escape Analysis
File(s): compiler/ori_repr/src/escape/thread.rs
Extend escape analysis (§08) to track thread boundaries.
-
Define thread escape:
#[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum ThreadLocality { /// Value never crosses a thread boundary ThreadLocal, /// Value may be shared across threads ThreadShared, /// Unknown (conservative: treat as ThreadShared) Unknown, } -
Identify thread boundary operations:
spawn()— values captured by the spawned closure cross threadschan.send(value)— value crosses thread via channel- Global mutable state (if Ori adds it) — shared by all threads
- FFI calls with unknown thread behavior → conservative (ThreadShared)
-
Propagate thread-locality:
pub fn analyze_thread_locality( func: &ArcFunction, escape_info: &EscapeInfo, pool: &Pool, ) -> FxHashMap<AllocId, ThreadLocality> { let mut locality = FxHashMap::default(); for alloc in func.allocations() { if escape_info.escape_state(alloc) == EscapeState::NoEscape { // Non-escaping → definitely thread-local locality.insert(alloc, ThreadLocality::ThreadLocal); continue; } // Check if any escape path crosses a thread boundary let crosses_thread = escape_info.escape_paths(alloc) .any(|path| path.crosses_thread_boundary()); locality.insert(alloc, if crosses_thread { ThreadLocality::ThreadShared } else { ThreadLocality::ThreadLocal }); } locality } -
Whole-program optimization:
- If the program has NO
spawn()calls and NO channel operations → ALL values are ThreadLocal - This is detectable with a simple call graph scan
- Enables ALL RC operations to be non-atomic for single-threaded programs
- If the program has NO
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (10.1) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-10.1 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 10.1: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
10.2 Non-Atomic RC Runtime
File(s): compiler/ori_rt/src/rc/nonatomic.rs (new file inside rc/ module)
Module placement: Must live inside rc/ (e.g., rc/nonatomic.rs with mod nonatomic; in rc/mod.rs) to access call_drop_fn and rc_underflow_abort which are pub(super). Note: ori_rt allows unsafe (it is NOT in the #![deny(unsafe_code)] list). Every unsafe block MUST have a // SAFETY: comment.
Risk warning: Non-atomic RC on a value that IS actually shared across threads causes data races (undefined behavior). The soundness of this entire section depends on §08’s escape analysis and §10.1’s thread escape analysis being correct. If the analysis is unsound, this creates UB that Valgrind (memcheck) will not catch — only helgrind/TSAN will. This section must be gated on §08 being fully verified first.
-
Add a debug-only RC mode guard before exposing any non-atomic runtime entry point:
- Store an
RcModeflag (AtomicvsNonAtomic) in debug builds only, either in a side table or a debug-only header word ori_rc_inc/ori_rc_decassert they are not touching an allocation marked non-atomicori_rc_inc_nonatomic/ori_rc_dec_nonatomicassert they are not touching an allocation marked atomic- Release builds pay zero cost for this guard
- This guard is mandatory; helgrind is a secondary verifier, not the only safety net
- Store an
-
Implement non-atomic RC operations:
#[no_mangle] pub unsafe extern "C" fn ori_rc_inc_nonatomic(data_ptr: *mut u8) { if data_ptr.is_null() { return; } // SAFETY: data_ptr was returned by ori_rc_alloc; strong_count is at data_ptr - 8. let rc_ptr = data_ptr.sub(8).cast::<i64>(); let count = *rc_ptr; // plain load (no atomic) if count >= MAX_REFCOUNT { std::process::abort(); } *rc_ptr = count + 1; // plain store (no atomic) } #[no_mangle] pub unsafe extern "C" fn ori_rc_dec_nonatomic( data_ptr: *mut u8, drop_fn: Option<extern "C" fn(*mut u8)>, ) { if data_ptr.is_null() { return; } // SAFETY: data_ptr was returned by ori_rc_alloc; strong_count is at data_ptr - 8. let rc_ptr = data_ptr.sub(8).cast::<i64>(); let count = *rc_ptr; // plain load (no atomic) // Underflow protection — matches ori_rc_dec (rc/mod.rs). // Always-on, not debug-only. Catches double-free bugs. if count <= 0 { rc_underflow_abort(data_ptr); } *rc_ptr = count - 1; // plain store (no atomic) if count == 1 { // Last reference — drop via abort-on-panic guard. // ori_rc_dec_nonatomic is nounwind; unwinding through it is UB. if let Some(f) = drop_fn { call_drop_fn(f, data_ptr); } } } -
Also provide width-specific non-atomic variants:
ori_rc_inc_nonatomic_i8,ori_rc_dec_nonatomic_i8ori_rc_inc_nonatomic_i16,ori_rc_dec_nonatomic_i16- Combines with §09 header compression
-
LLVM codegen selects atomic vs. non-atomic based on
ReprPlan::rc_strategy():match repr_plan.rc_strategy(type_idx) { RcStrategy::None => { /* skip RC */ } RcStrategy::Atomic { width } => { // Call ori_rc_inc_$width / ori_rc_dec_$width (§09 width-suffixed) // ABI: inc(data_ptr), dec(data_ptr, drop_fn) — matches existing contract emit_atomic_rc(width); } RcStrategy::NonAtomic { width } => { // Call ori_rc_inc_nonatomic_$width / ori_rc_dec_nonatomic_$width // Same 2-arg dec ABI: dec(data_ptr, drop_fn) emit_nonatomic_rc(width); } } -
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (10.2) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-10.2 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 10.2: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
10.3 Migration Fence
File(s): compiler/ori_repr/src/arc_opt/migration.rs
If a value transitions from thread-local to thread-shared (e.g., sent on a channel), the non-atomic refcount must be migrated to atomic.
-
Design decision: static vs. dynamic migration
(a) Static (recommended): The compiler proves at compile time that a value is either always thread-local or always thread-shared. No runtime migration needed. If uncertain → atomic.
(b) Dynamic: Store a flag in the header indicating atomic/non-atomic. When a value crosses a thread boundary, flip the flag and issue a memory fence. Adds 1 bit of overhead + branching on every RC operation.
Recommendation: Option (a) for initial implementation. It’s simpler, has zero runtime overhead, and covers the vast majority of cases. Option (b) is only needed if analysis misses important cases (measure first).
-
If using static migration:
- At channel send: if value is marked non-atomic → compile error or automatic promotion to atomic at compile time
- At spawn: closure captures analyzed → all captured values promoted to atomic if needed
- The promotion happens at compile time, not runtime
-
/tpr-reviewpassed — independent review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — hygiene review clean. MUST run AFTER/tpr-reviewis clean. -
Subsection close-out (10.3) — MANDATORY before starting the next subsection. Run
/improve-toolingretrospectively on THIS subsection’s debugging journey (per.claude/skills/improve-tooling/SKILL.md“Per-Subsection Workflow”): whichdiagnostics/scripts you ran, where you addeddbg!/tracingcalls, where output was hard to interpret, where test failures gave unhelpful messages, where you ran the same command sequence repeatedly. Forward-look: what tool/log/diagnostic would shorten the next regression in this code path by 10 minutes? Implement improvements NOW (zero deferral) and commit each via SEPARATE/commit-pushusing a valid conventional-commit type (build(diagnostics): ... — surfaced by section-10.3 retrospective—build/test/chore/ci/docsare valid;tools(...)is rejected by the lefthook commit-msg hook). Mandatory even when nothing felt painful. If genuinely no gaps, document briefly: “Retrospective 10.3: no tooling gaps”. Update this subsection’sstatusin section frontmatter tocomplete. -
/sync-claudesection-close doc sync — verify Claude artifacts across all section commits. Map changed crates to rules files, check CLAUDE.md, canon.md. Fix drift NOW. -
Repo hygiene check — run
diagnostics/repo-hygiene.sh --checkand clean any detected temp files.
10.4 Completion Checklist
Test matrix for §10 (write failing tests FIRST, verify they fail, then implement):
| Program pattern | Expected RC variant | Semantic pin |
|---|---|---|
| Single-threaded program with many RC operations | All ori_rc_inc_nonatomic / ori_rc_dec_nonatomic calls | Yes — zero ori_rc_inc (atomic) in LLVM IR |
Single-threaded program: no spawn(), no channel() calls | ALL values use non-atomic RC (whole-program optimization) | Yes — zero atomic RC ops in IR |
Multi-threaded: spawn() captures a list | Captured list uses atomic RC; local list in spawned fn uses non-atomic | Yes — split atomic/non-atomic |
chan.send(value) — value crosses channel | value promoted to atomic RC before send | Yes — ori_rc_inc (atomic) before send |
Value created AFTER spawn() in spawned closure | Non-atomic (thread-local to that thread) | Yes — post-spawn local stays non-atomic |
| Width-specific non-atomic: bounded value + thread-local | ori_rc_inc_nonatomic_i8 / ori_rc_dec_nonatomic_i8 | Yes — narrow + non-atomic combined |
| Non-atomic RC correct behavior: single-thread dec to 0 | Value dropped correctly (same semantics as atomic) | Yes — correctness equivalence |
- Write failing test matrix BEFORE implementation (verify tests fail with current all-atomic codegen)
- Single-threaded programs: ALL RC operations use
ori_rc_*_nonatomicvariants - Multi-threaded programs: only thread-shared values use atomic RC
- Channel sends correctly mark values as thread-shared
- Spawn captures correctly mark captured values as thread-shared
- Width-specific non-atomic variants:
ori_rc_inc_nonatomic_i8,ori_rc_dec_nonatomic_i8,ori_rc_inc_nonatomic_i16,ori_rc_dec_nonatomic_i16(combines with §09) - Add semantic pin test: a single-threaded program produces ZERO atomic RC operations in LLVM IR (all ops are
ori_rc_*_nonatomic). This test can ONLY pass with thread-local analysis enabled. - Debug builds assert on RC-mode mismatches (atomic API used on non-atomic allocation or vice versa)
- Non-atomic RC operations are measurably faster (benchmark ≥ 20% improvement in RC-heavy workloads)
-
./diagnostics/dual-exec-verify.shpasses — non-atomic RC produces identical behavior to atomic RC - Extend
diagnostics/valgrind-aot.shto accept an optional--tool=helgrindpassthrough flag:- Add
--helgrindflag to the script: when present, pass--tool=helgrind --fair-sched=yesto valgrind instead of--tool=memcheck - This is a concrete shell script change, not “invoke manually”
- File:
diagnostics/valgrind-aot.sh(modify the valgrind invocation line)
- Add
- Run helgrind on AOT binaries compiled from multi-threaded Ori programs (channel + spawn patterns):
./diagnostics/valgrind-aot.sh --helgrind tests/valgrind/threads/ - Create
tests/valgrind/threads/directory with at minimum:thread_local_only.ori— single-threaded program with many RC operations → no helgrind raceschannel_send.ori— program that sends values through a channel → helgrind must find no races
-
./test-all.shgreen -
./clippy-all.shgreen -
./diagnostics/dual-exec-verify.shpasses -
/tpr-reviewpassed — independent Codex review found no critical or major issues (or all findings triaged) -
/impl-hygiene-reviewpassed — implementation hygiene review clean (phase boundaries, SSOT, algorithmic DRY, naming). MUST run AFTER/tpr-reviewis clean. -
/improve-toolingretrospective completed — MANDATORY at section close, after both reviews are clean. Reflect on the section’s debugging journey (whichdiagnostics/scripts you ran, which command sequences you repeated, where you added ad-hocdbg!/tracingcalls, where output was hard to interpret) and identify any tool/log/diagnostic improvement that would have made this section materially easier OR that would help the next section touching this area. Implement every accepted improvement NOW (zero deferral) and commit each via SEPARATE/commit-push. The retrospective is mandatory even when nothing felt painful — that is exactly when blind spots accumulate. See.claude/skills/improve-tooling/SKILL.md“Retrospective Mode” for the full protocol.
Exit Criteria: A single-threaded benchmark program shows 0 atomic RC operations in LLVM IR (all ori_rc_*_nonatomic). Performance benchmark shows ≥20% improvement in RC-heavy workloads vs. atomic-only baseline.
10.R Third Party Review Findings
-
[TPR-10-001][major]section-10-thread-local-arc.md:117-148— Non-atomic RC has no debug-mode safety net; analysis wrong → silent UB. Theori_rc_inc_nonatomic/ori_rc_dec_nonatomicfunctions use plain loads/stores (lines 120, 124:*rc_ptr). If thread escape analysis (§08+§10.1) is unsound for any value, concurrent access produces data race UB. The plan acknowledges this risk (line 111) but proposes no runtime fallback — only helgrind testing as a detection tool. No mechanism exists to verify at runtime that a value classified as thread-local is actually single-threaded. Action: Add a debug-mode#[cfg(debug_assertions)]per-allocation flag that records the RC mode (atomic vs non-atomic). Assert on mismatched access (e.g.,ori_rc_inccalled on an allocation marked non-atomic). Zero cost in release builds. This catches analysis bugs during development before they become silent data races in production. Consensus: 3/3 reviewers.